Crop Science Journal of Natural Resources and Life Sciences Education
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Published online 24 February 2006
Published in Crop Sci 46:854-864 (2006)
© 2006 Crop Science Society of America
677 S. Segoe Rd., Madison, WI 53711 USA
This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via ISI Web of Science (3)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Franco, J.
Right arrow Articles by Taba, S.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Franco, J.
Right arrow Articles by Taba, S.
Agricola
Right arrow Articles by Franco, J.
Right arrow Articles by Taba, S.
Related Collections
Right arrow Statistics
Right arrow Plant Genetic Resources

PLANT GENETIC RESOURCES

Sampling Strategies for Conserving Maize Diversity When Forming Core Subsets Using Genetic Markers

Jorge Francoa, José Crossab,*, Marilyn L. Warburtonb and Suketoshi Tabab

a Facultad de Agronomía, Universidad de la República, Av. Garzón 780 CP 12900. Montevideo, Uruguay
b International Maize and Wheat Improvement Center (CIMMYT), Apdo. Postal 6-641, 06600, Mexico D.F., Mexico

* Corresponding author (j.crossa{at}cgiar.org)


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Core subsets can be formed on the basis of molecular markers and different sampling strategies. This research used genetic markers on three maize data sets for studying 24 stratified sampling strategies to investigate which strategy conserved the most diversity in the core subset as compared with the original sample. The strategies were formed by combining three factors: (i) two clustering methods (UPGMA and Ward), based on (ii) two initial genetic distance measures, and using (iii) six allocation criteria [two based on the size of the cluster and four based on maximizing distances in the core (the D method) used with four diversity indices]. The objectives were (i) to study the influence of these factors and their interaction on the diversity of the core subsets and (ii) to compare the 24 stratified sampling strategies with the M strategy implemented in the MSTRAT algorithm. Success of each strategy was measured on the basis of maximizing genetic distances (Modified Roger and Cavalli-Sforza and Edwards distances) and genetic diversity indices (Shannon index, proportion of heterozygous loci, and number of effective alleles) in each core. Twenty independent stratified random samples were obtained for each strategy using a sampling intensity of 20% of the collection. For the three data sets, the UPGMA with D allocation methods produced core subsets with significantly more diversity than the other methods and were better than the M strategy for maximizing genetic distance. For most of the diversity indices, the M strategy outperformed the D method.

Abbreviations: MR, Modified Rogers • CE, Cavalli-Sforza and Edwards • SH, Shannon diversity index • HE, proportion of heterozygous loci per individual • NE, the number of effective alleles • PN, proportion of non-informative alleles in the sample


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
GENETIC RESOURCES stored in germplasm banks are sampled for regeneration, characterization, evaluation, studying phenotypic and genotypic diversity, forming core subsets, and eliminating redundant and duplicate accessions. When these activities are performed, it is of paramount importance to preserve in the sample as much of the diversity present in the original collection as possible (Crossa et al., 1995a). Core subsets are assembled to facilitate the intensive study, evaluation, and utilization of genetic resources stored in germplasm collections, but this implies a substantial reduction in the number of accessions compared with the initial collection and thus a possible reduction in the genetic and phenotypic diversity compared with what exists in the original collection (Frankel and Brown, 1984; Brown, 1989). Core subsets can be formed on the basis of morphological, phenotypic, or molecular marker data. The latter would reflect changes that have occurred at the DNA level but are not necessarily expressed in the phenotype of the organism.

A sampling strategy involves defining a sampling intensity, a sampling method, and an allocation method (Thompson, 2002). The sampling intensity defines the sample size as a proportion of the population size, and for core subsets, several authors suggest intensities that range from 5 to 20% of the total number of accessions (Brown, 1989; Schoen and Brown, 1993; Brown and Spillane, 1999; van Hintum, 1999; van Hintum et al., 2000). Regarding sampling methods, researchers have recommended stratified sampling strategies (as opposed to simple random sampling) for managing genetic resources and forming core subsets (Peeters and Martinelli, 1989; Crossa et al., 1994, 1995a, 1995b; Spagnoletti Zeuli and Qualset, 1993; Charmet and Balfourier, 1995; Rincon et al., 1996; Chandra et al., 2002; Franco et al., 2003). These stratified sampling strategies suggest first classifying or clustering the genotypes on the basis of prior knowledge such as origin, passport data, or a numerical classification, followed by an allocation of accessions from each cluster to the subset. Franco et al. (1998, 1999, 2002, 2003) and Franco and Crossa (2002) propose a sequential clustering strategy for forming core subsets using discrete and continuous morphological data simultaneously. This classification approach has been extensively used for forming core subsets in tropical maize, Zea mays L. (Taba et al., 1998, 1999, 2001).

There are many classification methods and many measurements of distances between individuals or populations. The most popular clustering methods include the unweighted pair-group method using arithmetic average, UPGMA (Sokal and Michener, 1958), and the Ward minimum variance within groups (Ward, 1963). Distances calculated between individuals or populations based on phenotypic data are generally Euclidean distance for continuous variables, and Gower's (1971) distance for mixtures of continuous and discrete variables. There are two types of distances used with molecular marker data, depending on the marker type used: (i) non-informative (i.e., Simple Matching, Jaccard, etc.), which use a table of absence or presence (0s and 1s) of each marker and each individual or population and (ii) informative, or genetic distances, (i.e., Modified Rogers, Cavalli-Sforza and Edwards, etc.), which use information on loci and alleles within loci for each individual or population.

When choosing a core subset using stratified random sampling, an allocation method determines the number of accessions to be selected from each cluster. For core subsets, Brown (1989) described three allocation methods whose sample sizes do not depend on the diversity of the clusters: (i) constant (or fixed) across clusters, (ii) proportional to the cluster size (P method), and (iii) proportional to the logarithm of the cluster size (L method). Franco et al. (2005), working with mixtures of phenotypic variables, proposed using Gower's distance between accessions within each cluster (D method) as the allocation criteria. The authors found that the D method used with Gower's distance was the best allocation strategy when compared with other allocation strategies.

Schoen and Brown (1993) addressed the issue of how to use genetic markers to sample collections of wild crops while maximizing allele richness. They proposed the M (maximization) strategy that maximizes the number of observed alleles at each marker locus. The allele richness of a core subset formed by the M strategy is defined in terms of the number of allele classes represented in the sample. Another strategy that uses genetic markers to form subsamples is the H strategy, which seeks to maximize the number of alleles in the core subset by sampling accessions from groups in proportion to their within cluster genetic diversity. Bataillon et al. (1996) used computer simulation for comparing the retention of neutral alleles when forming core collections using nonmarker-based random sampling and stratified random sampling strategies versus the M strategy using genetic markers. The M strategy was more effective for retaining widespread and low frequency neutral alleles than the other sampling strategies. The MSTRAT algorithm developed by Gouesnard et al. (2001) implements the M strategy for selecting accessions that increase the number of allele classes. McKhann et al. (2004) formed core collections from 265 accessions of Arabidopsis thaliana (L.) Heynh. using MSTRAT and single nucleotide polymorphisms (SNPs) from a limited number of DNA fragments.

Marita et al. (2000) developed a computer algorithm that selects accessions for the core by ranking the genetic distance between each accession and the mean of all others. They suggest that core subsets can be formed that either include rare and localized alleles, which will maximize the total allele diversity in the core (as favored by taxonomists and geneticists) or constructed by including widely adapted accessions that maximize the representativeness of the genetic diversity in the core (which is the breeder's perspective). It can be deduced that the breeder's perspective might not maximize the total allele diversity of the core, especially for traits not used to form the core.

Different classification and allocation methods were used by Franco et al. (2005) to form core subsets, employing continuous and categorical phenotypic variables. How the D method proposed by Franco et al. (2005) will perform with genetic distances or allele diversity calculated with molecular markers, and how these will compare with the M strategy, has not been tested.

In this study, 24 sampling allocation strategies were created from the combination of the following three factors: (i) two clustering methods (UPGMA and Ward), (ii) two initial genetic distance measures (Modified Rogers, or Cavalli-Sforza and Edwards), and (iii) six allocation criteria [two of them based on the size of the cluster and four based on the D method of Franco et al. (2005) used with two genetic distances and with three allele diversity indices]. The first objective of this research was to study the influence of these three factors and their interaction on the diversity of the core subsets formed. The second objective of the research was to compare these 24 sampling strategies with the M strategy implemented in the MSTRAT algorithm. Twenty independent stratified random samples were obtained from the molecular markers of three maize data sets, using one sampling intensity (20%) with the purpose of evaluating the different factors and allocation methods affecting the diversity of the core subsets.


    MATERIALS AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Data Sets
Three data sets were used in this study: (i) the "bulk data set" obtained from fingerprinting 275 bulks (populations represented by two bulks of 15 genotypes each) of maize landrace populations from the Americas and Europe, using 24 SSR markers with at least one SSR per chromosome and a total of 186 alleles (Dubreuil et al., 2006); (ii) the "accession data set" obtained from fingerprinting 521 maize individuals from 25 maize populations (including 24 different landraces collected in different geographical regions of Mexico and one breeding population generated from a mixture of landraces), using 26 SSR markers with at least one SSR per chromosome and a total of 209 alleles (Warburton et al., 2004); and (iii) the "populations data set," obtained from the accession data set by grouping the individuals of each population, and calculating the allele frequency per population; this data set had a total of 25 populations and 209 alleles (Warburton et al., 2004). The bulk data and accession data sets only have seven SSR markers in common.

The values of the bulk data set are frequencies obtained with Genotyper 2.1 (PerkinElmer/Applied Biosystems) and the MSSC program, a SAS/IML code written by Dubreuil et al. (2003); allele frequency values fall between the interval [0, 1]; this data set had 1.5% missing values. The frequency values for the accession data set are 0.0, 0.5, or 1.0; these frequencies correspond to each codominant allele of the diploid individuals; no missing values were observed. The frequency values of the population data set are within the [0, 1] interval; no missing values were observed.

Cluster Analyses
The Ward (1963) and the UPGMA (Sokal and Michener, 1958) procedures for clustering observations are hierarchical techniques well described in Kaufman and Rousseeuw (1990). Both clustering strategies can be calculated from different initial matrices of distance between genotypes, such as the Modified Rogers (MR) and Cavalli-Sforza and Edwards (CE) genetic distance measurements. Both clustering methods were used in this study using the CLUSTER procedure of SAS (2000). The number of groups was determined using the pseudo-F (Calinski and Harabasz, 1974) and pseudo-t2 statistic related to the J(e) statistic (Duda and Hart, 1973). The pseudo-F and the J(e) statistics were found by Milligan and Cooper (1985) to be the best two criteria (out of 30) for defining the number of groups. We used two criteria because there is not a unique solution and the criteria can propose different solutions.

Distances and Diversity Indices
To test the success of each sampling allocation strategy in conserving diversity in the core as compared with the original collection, we used two genetic distances between pairs of genotypes, and three diversity indices. Recently, Reif et al. (2005) gave mathematical and genetic details of the most commonly used distance measures used with molecular data. The genetic distances were the Modified Rogers (MR) and the Cavalli-Sforza and Edwards (CE), and the three diversity indices were the Shannon diversity index (SH), the expected proportion of heterozygous loci (HE), and the number of effective alleles (NE) of each cluster and the whole collection. In addition, we used the proportion of noninformative alleles in the sample (PN). These six measures were defined as follows:

  1. The Modified Rogers distance (MR) between a pair of genotypes x,y (MRxy) is 0 ≤ MRxy = FormulaFormula ≤ 1 where pxla is the estimated frequency of the allele a, within locus l, at genotype x; L the number of loci (SSR markers), and nl the number of alleles within the lth locus;
  2. The Cavalli-Sforza and Edwards distance (CE) between a pair of genotypes x,y (CExy) is 0 ≤ CExy = Formula ≤ 1;
  3. The Shannon diversity index (SH) of the entire sample is SH = Formula{sum}a = 1Apa ln(pa) where A = {sum}l = 1Lnl is the total number of alleles in the sample, and pa is the frequency of the ath allele over the whole sample ({sum}a = 1Apa = 1);
  4. The expected proportion of heterozygous loci per individual (HE) is 0 ≤ HE = Formula{sum}l = 1L(1 – {sum}a = 1nlpla2) ≤ 1. HE is a composite measure that summarizes genetic variation at the allele level (Berg and Hamrick, 1997), and it is computed as the mean of HE for each locus;
  5. The effective number of alleles (NE) is 1 ≤ NE = Formula, and it measures the number of alleles at a locus and the equality of the allele frequencies at that locus (Berg and Hamrick, 1997); and
  6. Proportion of non-informative alleles in the sample (PN) is an auxiliary variable measuring the number of alleles containing only values of zero for every genotype selected in the core subset.

Allocation Methods
P and L Allocation Methods
These allocation methods are implemented on the basis of the size of the clusters (but not on their diversity). The P allocation method uses the size of the tth cluster (Nt) to obtain the sample size of the tth cluster (nt) nt = n x Formula, where n is the total sample size (10 or 20%). The L allocation method proposed by Brown (1989) uses the logarithm of the size of the tth cluster (Nt) to obtain the sample size of the tth cluster (nt) nt = n x Formula, where n is the total sample size (10 or 20%).

D Allocation Method
The D allocation strategy was proposed by Franco et al. (2005) using, as D-criterion, the Gower distance for a mixture of discrete and continuous phenotypic traits. When used with molecular marker data, the D method determines the size of the sample to be drawn from each cluster, which should be proportional to a genetic distance or allele diversity measure within the cluster. Groups that are more diverse will have a larger mean genetic distance (or greater diversity) and therefore larger samples will be drawn from them.

For t = 1,2,..., g clusters, the number of accessions (nt) to be drawn from the tth cluster is nt = n x pt = n x Formula where n is the total sample size to be drawn from the collection (which in this study will be 20% of the entire collection), pt is the proportion of the sample size to be drawn from the tth cluster, and Formula t is the mean of the genetic distances between accessions within the tth cluster (MR or CE) or the value for the selected diversity index of this cluster (SH, HE, or NE).

Strategies Using Stratified Sampling
To study the influence of three factors (forming 24 factorial combinations) on the diversity of the resulting core subset, the 24 sampling allocation strategies were tested one at a time and the diversity present in the resulting core subsets compared. To simplify the notation we assigned a number (1–24) to each of the 24 strategies as shown in Table 1.


View this table:
[in this window]
[in a new window]
 
Table 1. Codes for the 24 strategies obtained by the combination of two clustering methods, two initial distances between genotypes, and six allocation methods.

 
For the second objective, the 24 sampling allocation strategies were compared with the M strategy. As described by Gouesnard et al. (2001), the M strategy is an algorithm used for selecting a subset of n accession from the N accessions of the entire collection. The algorithm consists of (i) forming a subset of n accessions chosen at random from N accessions of the whole collection, (ii) all possible subsets of size (n – 1) are tested for allele diversity and the subset showing the highest level of richness is retained, and (iii) the accession bringing the greatest increment in the diversity criterion among the remnant accessions is retained, forming an n-size new subset. Steps (ii) and (iii) are repeated until the richness of the subset is no longer improved. The diversity of the core subsets formed is measured using a score of allele richness. If core subsets have the same diversity score, the Nei or Shannon index is used as a criterion for breaking the tie.

Alleles at a locus might be related to alleles at another locus because they share common ancestries or because the species has a certain mating system (e.g., self-fertilization) that favors gametophytic disequilibrium. The M strategy correlates the allele richness at a marker locus with the allelic richness at a target locus.

Independent Samples
Independent Stratified Random Samples
A preliminary study was conducted to determine the minimum number of independent stratified random samples (independent replicates or independent core subsets) required for detecting differences of less than 2% of the overall mean between factor levels at two sampling intensities, 10 and 20%. Five hundred independent repetitions (500 independent core subsets) for each of the 24 factorial combinations were used. Results showed that 20 independent stratified random samples (core subsets) for each of the 24 combinations of factors were sufficient to detect differences between factor levels of 2% or less of the overall mean, and a sample intensity of 20% always performs better than 10%.

The generation of 20 independent stratified random samples from a population (collection), was done by the SURVEYSELECT procedure of SAS (SAS Institute, 2000). Similarly, 20 independent core subsets using the MSTRAT software (http://www.ensam.inra.fr/gap/resgen88; verified 1 December 2005) developed by Gouesnard et al. (2001) for the M strategy application were obtained for further comparison. The final number of iterations run on MSTRAT were 25, 55, and 105 for the population, bulk, and accession data sets, respectively. A higher number of iterations did not improve the values of the core subsets' diversity measurements.

Selection Criteria for the Best Core Subset
The criteria used for selecting the best core subsets are the same as those used for creating the core subsets. We used two criteria for selecting the best core subset: (i) the average genetic distances criteria (MRs or CEs) between pair of accessions of the core subset and (ii) the diversity indices (SHs, HEs, and NEs) for measuring the allele richness of each core subset. Thus, the best core subset is the one with (i) the highest average genetic distance between accessions (measured with MRs and CEs), (ii) the highest allele richness (as measured with SHs, HEs and NEs), and (iii) the lowest proportion of non-informative alleles, (PN)s. These criteria are in agreement with Marita et al. (2000), who suggests that core subsets can be formed with the perception of maximizing the total diversity and thus ensuring the inclusion of restricted or rare alleles (taxonomists' or geneticists' perspective), or instead, by maximizing the representativeness of the genetic diversity in the core subset by including "generalists" alleles (breeder's perspective). Values of MRs, CEs, SHs, HEs, NEs, and PNs were obtained from 20 independent replicates (core subsets) and subjected to statistical analyses where they were used as response variables.

Statistical Analyses
For evaluating the importance of the main effects of the three factors and their interaction (first objective), variance components and their contribution to the total variance were estimated by the VARCOMP procedure from SAS (2000) with the Restricted Maximum Likelihood Estimation (REML) option. Also, mean comparisons were performed for the significant sources of variation by the Tukey test at the 1% probability level. Twenty independent core subsets for each factorial combination were used and the response variables were MRs, CEs, SHs, HEs, NEs, and PNs.

For comparing the 24 sampling strategies with the M strategy (MSTRAT algorithm) (second objective), the 25 resulting treatments (with 20 independent replicates each) were compared for MRs, CEs, SHs, HEs, NEs, and PNs by the Dunnet test (P ≤ 0.01).

All procedures were implemented in SAS-IML software, the SAS-STAT module, and the MSTRAT software.


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Clustering and Sizes of the Groups
The clustering of all data sets was performed by the UPGMA and Ward strategies on two initial matrices of distance between genotypes, the MR and the CE distance matrices. Concerning the sizes of the clusters, as expected, the P and L allocation methods sampled more genotypes from the larger clusters, whereas the D methods sampled more genotypes from the clusters showing higher values for each allocation criteria (MR or CE, SH, HE, and NE). For example, for the bulk data set, the UPGMA-MR cluster formed clusters 1 and 3 with 186 and 16 accessions, respectively; the P method, proportional to the cluster size, chose 38 accessions from cluster 1 and 3 accessions from cluster 3. On the other hand, and according to the diversity of the cluster, the D method chose 13, 14, 15, and 15 accessions when used with the MR, SH, HE, and NE criteria from cluster 1 (with values of 0.389, 4.379, 0.605, and 2.845 for MR, SH, HE, and NE, respectively) and chose 16, 14, 15, and 15 accessions from cluster 3 (with values of 0.491, 4.329, 0.595, and 2.730 for MR, SH, HE, and NE, respectively; data not shown).

Effect of the Factors and Their Interactions on the Diversity of the Core Subset
Variance Components
For the bulk data set, the distance criteria MRs and CEs were mainly affected by the allocation method (ALLOC) (54 and 56% of the total variance, respectively) (Table 2) and the cluster method (CM) (28 and 15%, respectively). They were less affected by the cluster method x allocation method interactions (CM x ALLOC) (7 and 5%, respectively) and the cluster method x distances (CM x DIST) (6 and 14%, respectively). For the selection criteria given by the diversity indices SHs, HEs, and NEs, the interaction of CM x DIST was a relatively important source of variability (28, 14, and 32%, respectively). The variable PNs for the bulk data set did not seem to be highly affected by any of these sources of variability. A sizeable portion of the total variance was left unexplained for the diversity indices' selection criteria SHs, HEs, NEs and PNs (58, 71, 59, and 88%, respectively). (Table 2).


View this table:
[in this window]
[in a new window]
 
Table 2. Percentage of the total variance explained among core diversities by different sources of variation measured in three data sets (bulk, accession, and population) using 20 independent replicates. Core diversity was measured with six selection criteria (two distances [Modified Rogers, MRs; and Cavalli-Sforza and Edwards, CEs], three diversity indices [Shannon, SHs; proportion of heterozygous loci per individual, HEs; number of effective alleles, NEs], and proportion of non-informative alleles into the core (PNs).

 
For the accession data set, the main effect of the CM was the most important sources of variability for all response variables (Table 2), followed by the CM x ALLOC interaction and the ALLOC main effect. The CM x DIST interaction affected mainly the diversity indices SHs and NEs (15 and 28%, respectively) and had less effect on the genetic distances MRs, CEs, and the diversity index HEs (3, 0, and 5%, respectively). A small portion of the total variance was left unexplained for MRs, CEs, SHs, HEs, and NEs (5, 4, 9, 9, and 11%, respectively). PNs was affected solely by the CM, and half of its variance was explained by the residual component (55%). For the accession data set, the variability of the six selection criteria was well explained by the main factors of CM and ALLOC and their interaction.

The variance components due to CM, DIST, ALLOC, and their interactions for the population data set were very similar to that found in the bulk data set. Values of MRs and CEs were mainly affected by the ALLOC (21 and 22% of the total variance, respectively, Table 2), the CM (33 and 38%, respectively), and less affected by the CM x ALLOC interactions (4 and 3%, respectively) and the CM x DIST (10 and 3%, respectively). For SHs, HEs, and NEs, most of the total variance was left unexplained.

In summary, for the three data sets, the distance selection criteria (MRs and CEs) were affected by clustering method, allocation method, and their interaction. On the other hand, the diversity selection criteria (SHs, HEs, and NEs) were affected only slightly by the clustering method at the bulk and population data sets and for all factors at the accession data set. The variance of the auxiliary selection variable PNs was explained principally by the residual component, showing that all factors produce a similar number of non-informative alleles in the core groups.

Mean Comparisons
For the bulk data set, the Tukey's tests indicated that, on average, UPGMA formed core subsets that are more diverse than did Ward for the two genetic distances (MRs and CEs) but less diverse than Ward for the three diversity indices (SHs, HEs, NEs) (Table 3). Allocation criteria that used the CE genetic distance formed more diverse core subsets than the MR genetic distance for MRs, CEs, SHs, HEs, and NEs. The allocation method that produces more diverse core subsets with respect to all six selection criteria used the D method with MR or CE, followed by the D method using the number of effective alleles (NE). This result suggests that for the bulk data, the D method is effective in selecting genotypes that are diverse in terms of genetic distances and with significantly higher allele richness.


View this table:
[in this window]
[in a new window]
 
Table 3. Means and Tukey's test{dagger} of the six selection criteria (MRs, CEs, SHs, HEs, NEs, PNs) for the core subsets formed by two cluster methods (CM) (UPGMA [U] and Ward [W]), two initial distance matrices (DIST) (Modified Rogers [MR] and Cavalli-Sforza and Edwards [CE]), and six allocation methods (ALLOC) (A, D, H, L, P and S) for the bulk, accession and population data sets, determined on the basis of 20 independent replicates. Values of the six selection criteria measured on the entire collection are also given (COLLECTION).

 
For the accession data set, core subsets formed by UPGMA were significantly more diverse than those formed by Ward for all the selection criteria except PNs (Table 3). The CE genetic distance formed more diverse clusters for all selection criteria except for MRs and PNs. The allocation methods A, D, and H were the best for all selection criteria, except PNs, indicating that the D method selects genetically diverse accessions (measured by either MRs or CEs) with high allele richness [measured by heterozygous (HEs) and number of effective alleles (NEs)].

For the population data set, core subsets assembled by UPGMA were significantly more diverse than those formed by Ward for all the selection criteria except NEs (Table 3). CE formed more diverse clusters than those formed using MR for all selection criteria except for MRs and PNs. The D allocation method used with MR or CE was the best for all six selection criteria.

For all three data sets, most of the allocation strategies formed core subsets that are more diverse than the entire collection measured for the six selection criteria (Table 3). These results indicated that the primary objective of using allocation strategies that form core subsets that will preserve the diversity of the original by eliminating redundant accessions was achieved.

In summary, results show that the most important trends were (i) the UPGMA cluster method was almost always better than Ward for the distances criteria, (ii) the CE genetic distance tended to be better than the MR for all three data sets, and (iii) for the three data sets the best allocation method was D used with MR, CE, or NE.

Comparing All Strategies with the M Strategy by the Diversity of Their Core Subsets
For each data set, the diversity average of the 20 independent core subsets formed for each of the 24 factor combinations were compared with the average diversity of 20 independent core subsets formed by the M strategy (Table 4 and Fig. 1Go 3 ).


View this table:
[in this window]
[in a new window]
 
Table 4. Summary of comparisons between core diversities when core subsets were formed using the best stratified strategy or the M-strategy. Core diversities were measured using six selection criteria: Modified Rogers (MRs), Cavalli-Sforza and Edwards (CEs), Shannon Diversity Index (SHs), proportion of heterozygous loci per individual (HEs), number of effective alleles, (NEs), and the proportion of non-informative alleles observed into the core (PNs). Comparisons were made for three data sets (bulk, accession and population) and are based on the mean of 20 independent replicates.

 

Figure 1
View larger version (59K):
[in this window]
[in a new window]
 
Fig. 1. Plot of the difference between the value of 24 strategies (Table 1) and the M strategy for six selection criteria, MSs (1a), CEs (1b) SHs (1c), HEs (1d), NEs (1e), and PNs (1f) for the bulk data set. Horizontal dash lines indicated critical values for Dunnet test (P ≤ 0.01) of the differences of each strategy above and below the M strategy from the mean of 20 replicates.

 

Figure 2
View larger version (60K):
[in this window]
[in a new window]
 
Fig. 2. Plot of the difference between the value of 24 strategies (Table 1) and the M strategy for six selection criteria, MSs (2a), CEs (2b) SHs (2c), HEs (2d), NEs (2e), and PNs (2f) for the accession data set. Horizontal dash lines indicate critical values for Dunnet test (P ≤ 0.01) of the differences of each strategy above and below the M strategy from the mean of 20 replicates.

 

Figure 3
View larger version (49K):
[in this window]
[in a new window]
 
Fig. 3. Plot of the difference between the value of 24 strategies (Table 1) and the M strategy for six selection criteria, MSs (3a), CEs (3b) SHs (3c), HEs (3d), NEs (3e), and PNs (3f) for the population data set. Horizontal dash lines indicated critical values for Dunnet test (P ≤ 0.01) of the differences of each strategy above and below the M strategy from the mean of 20 replicates.

 
For the bulk data set, strategies 15 and 17 through 24 (numbers according to Table 1) formed statistically more diverse core subsets than the M strategy for the MRs genetic distance selection criterion (Fig. 1a); the best strategy (24) was 5.4% better than the M strategy (Table 4) for MRs. With respect to the CEs selection criterion, only strategies 18, 23, and 24 were statistically better than the M strategy (Fig. 1b) (strategy 24 was 1.2% superior to the M strategy, Table 4). However, for the rest of the selection criteria, SHs, HEs, NEs, and PNs, the M strategy formed statistically more diverse core subsets (Fig. 1c–1f). Note that all 24 strategies obtained significantly more noninformative alleles (higher PNs) than the M strategy; the average of 20 core subsets formed by strategy 24 had 6.6% of PNs, whereas core subsets from the M strategy had 2.1% of PNs (Table 4).

For the accession data set, strategies 14 through 18 and 21 through 24 formed core subsets significantly more diverse than those formed by the M strategy for the MRs response variable (Fig. 2a). With regard to the CEs selection criteria, strategies 15–18 and 21–24 formed core subsets with the same diversity as the M strategy (Fig. 2b); the best strategy (24) was only 0.1% better than the M strategy (Table 4). The best strategy in terms of SHs was the M strategy (Fig. 2c). For the HEs selection criteria, strategies 23 and 24 formed core subsets with the same diversity as the M strategy (Fig. 2d), and for the NEs selection criteria, strategies 21, 23, and 24 formed core subsets with the same diversity as the M strategy (Fig. 2e). Core subsets from the best strategy (24) had, on average, 16.4% noninformative alleles, whereas core subsets formed from the M strategy had no noninformative alleles.

Results for the population data set were very similar for MRs and CEs. Strategies 15 through 18 and 20 through 24 formed significantly more diverse cores than the M strategy (Fig. 3a and 3b), with gains over the M strategy of 5.6 and 4.3%, respectively (Table 4). For the selection criteria, SHs, NEs, and PNs, the M strategy formed statistically more diverse core subsets (Fig. 3c, 3e, 3f). For HEs, strategies 20, and 22 through 24 did not differ from the M strategy (Fig. 3d). The percentage of noninformative alleles was 26.4% for strategy 24 and 18.3% for the M strategy.

In summary, for criteria based on MRs, strategies 15, 17, 18, and 21–24 formed more diverse core subsets than the M strategy in the three data sets, whereas for CEs, strategies 18, 23, and 24 were better than the M strategy. For the response variables related to the diversity indices, SHs, HEs, and NEs, and PNs, the M strategy was better than any of the D strategies, except for HEs in the accession and population data sets, where strategies 23 and 24 formed core subsets with diversity similar to those formed by the M strategy.


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
This study used three data sets to find the most effective sampling strategy with the objective of forming the most diverse core subset. Results proved the effectiveness of the numerical clustering methods for classifying accessions into groups using genetic distances, followed by a stratified sampling proportional to the diversity of the clusters. When a clustering method is used in conjunction with either genetic distance between accessions and/or diversity index (D methods) within groups, its effectiveness in terms of retention of diverse genotypes with significant allele richness is increased. As already found by Franco et al. (2005), when using phenotypic data, the core subsets assembled with an allocation strategy proportional to the size of the clusters (P method and L method) did not produce core subsets more diverse than those formed by the D method, which uses an allocation strategy proportional to the diversity within the clusters.

It can be speculated that the D method with the genetic distances MR or CE between genotypes should produce core subsets that would tend to represent the genotype diversity present in the entire collection (closer to the breeder's objectives). On the other hand, the D method with the allele diversity indices SH, HE, or NE will tend to produce core subsets that will fully represent the allele diversity present in the entire collection (closer to the taxonomist's objectives). The D method used with genetic distances or allelic diversity indices seems to combine well the two main criteria by which accessions are selected for forming a core subset: (i) allele representativeness obtained by the retention of alleles that are widespread in the collection but usually found at low frequencies in each accession and (ii) allele richness achieved by conserving very localized (rare) alleles that might be found at high frequency in only a few accessions. This agrees with the reasoning of Marita et al. (2000) regarding conserving alleles by using the breeders' and taxonomists' criteria simultaneously. The objective of the D method is to select the most diverse accessions in terms of genetic distances among genotypes (breeder's perspective for forming core subsets), whereas the M strategy emphasizes selecting accessions with the most diverse alleles (taxonomists‘–geneticists’ perspective). Nevertheless, some D strategies showed a good recovery of the allelic diversity when compared with the M strategy in the accession data set for HEs and NEs and in the population data set for HEs. The ultimate goal is to form core subsets that simultaneously maximize allele richness and representativeness. Results of this research indicated that the D method provides appropriate criteria for forming core subsets that will tend to increase both, genetic distance and genetic diversity of the accession in different clusters.

Results of this study agreed with Bataillon et al. (1996) and Gouesnard et al. (2001) in the sense that the M strategy is very effective in eliminating redundancy that comes from noninformative alleles because of possible correlations among loci (linkage disequilibrium), which can arise from shared coancestry (population mixtures) and certain assortative mating systems. The M strategy was effective for forming core subsets with high allele richness and a low proportion of noninformative alleles. Franco et al. (2001) determined that large numbers of molecular markers are noninformative for the purpose of classifying genotypes and proposed statistical analyses for discarding redundant markers. The D methods, however, show less effectiveness in not including noninformative alleles in the core subsets than the M strategy.

Simulated results from Bataillon et al. (1996) show that the M strategy did not outperform other strategies when migration and outcrossing are present in the collection, but it performed well when associations between loci are caused by population substructure due to selfing in the absence of gene flow. This might be a reason why the M strategy outperformed the D method on the diversity indices SHs, and NEs in the population data set but not on HEs and NEs in the accession data set. In the accession data set, the HE correlation between the 26 loci was very low (0.05) but higher (0.14) when analyzed as populations. The HE correlations were obtained by computing the HE of each locus in each individual and then calculating the Pearson correlation between the HE values across individuals for every pair of loci. In the accession data set, it is expected that outcrossing between different landraces will be substantial because more than one accession was collected from neighboring geographical regions in Mexico (several accessions from the same states), where the farmers have been found to exchange seeds with their neighbors quite frequently (Pressoir and Berthaud, 2004). On the other hand, it is not expected that outcrossing will occur frequently between landraces that are geographically distant. Although, the population data set was derived from the accession data set, their structures are different. The population and bulk data sets are similar in the sense that both contain allele frequencies combined over several individuals (per bulk or population); however, the structure of the accession data set had allele frequencies for each individual disregarding the groups (populations) to which they belong.

The D method was always superior to the M strategy with respect to the breeders' perspective and equal or worse than the M strategy concerning the taxonomists' point of view. One advantage of the D method is that it can be used with continuous and categorical variables because it is possible for the Ward-MLM or UPGMA-MLM strategies to form clusters with continuous and discrete variables (Franco et al., 2001). Using continuous variables with the M strategy requires the continuous variable to be broken into several series of discrete variables. Although the criterion of conserving allele diversity for qualitative loci is important, the challenge of preserving quantitative genetic variation in conjunction with marker variation should be considered given the potential for marker-based genetic resources conservation and germplasm enhancement. Further research is required to examine the effectiveness of the D allocation method when phenotypic and genetic marker data are used simultaneously. The presence of redundant markers can be detected and the simultaneous use of relevant genetic markers and quantitative traits will form better core subsets. Nevertheless, further research is required to examine the performance of the D method as compared with the M strategy for sampling individuals and forming diverse core subsets for different maize materials and for other crops as well as for genetic markers other than the SSRs used in this research.

Received for publication July 11, 2005.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 




This article has been cited by other articles:


Home page
BioinformaticsHome page
K.-W. Kim, H.-K. Chung, G.-T. Cho, K.-H. Ma, D. Chandrabalan, J.-G. Gwag, T.-S. Kim, E.-G. Cho, and Y.-J. Park
PowerCore: a program applying the advanced M strategy with a heuristic search for establishing core sets
Bioinformatics, August 15, 2007; 23(16): 2155 - 2162.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via ISI Web of Science (3)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Franco, J.
Right arrow Articles by Taba, S.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Franco, J.
Right arrow Articles by Taba, S.
Agricola
Right arrow Articles by Franco, J.
Right arrow Articles by Taba, S.
Related Collections
Right arrow Statistics
Right arrow Plant Genetic Resources


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
The SCI Journals Agronomy Journal Vadose Zone Journal
Journal of Natural Resources
and Life Sciences Education
Soil Science Society of America Journal
Journal of Plant Registrations Journal of
Environmental Quality
The Plant Genome