|
|
||||||||
a Facultad de Agronomía, Universidad de la República, Av. Garzón 780 CP 12900, Montevideo, Uruguay
b Biometrics and Statistics Unit, CIMMYT, Apdo. Postal 6-641, 06600, Mexico DF, Mexico
c Maize Genetic Resources Unit, CIMMYT, Mexico
d National Center of Genetic Resources Preservation (NCGRP), USDA, ARS, Fort Collins, CO 80523
* Corresponding author (j.crossa{at}cgiar.org)
| ABSTRACT |
|---|
|
|
|---|
Abbreviations: DA, days to anthesis DS, days to silking EH, ear height GM, grain moisture MLM, Modified Location Model PH, plant height
| INTRODUCTION |
|---|
|
|
|---|
The process of sampling genetic resources with the objective of forming core subsets starts with grouping accessions to obtain homogeneous within and heterogeneous between clusters (or groups) and then using a predetermined sampling strategy within each cluster. The grouping of accessions into clusters is achieved by a classification strategy that partitions the original collection into groups with maximum distances between accessions located in different groups and minimum distances between accessions located in the same group. Franco et al. (1998)(1999, 2002) and Franco and Crossa (2002) proposed a sequential Ward-Modified Location Method (MLM) strategy in which the Gower (1971) distance is used as a measure of similarity (or distance) among accessions considering all continuous and categorical variables. The initial groups were formed by the Ward (1963) method, and then the MLM was used to improve those groups. The Ward-MLM strategy was used for analyzing the Latin American Maize Project (Taba et al., 1999) and Caribbean maize collections (Taba et al., 1998) with data from more than 10 countries and with the number of observations per collection ranging from 100 to 1800 and a mixture of continuous and discrete variables. These studies demonstrated that the Ward-MLM formed compact and well separated clusters.
The reason for sampling accessions when forming core subsets is to identify a strategy that will structure a sample that recovers most of the diversity contained in the original collection, while maximizing the variance and the distances between accessions in the sample. A sampling strategy involves defining a sampling intensity, a sampling method, and an allocation method (Thompson, 2002).
The sampling intensity defines the overall sample size, and for core collections, several authors studied sampling intensities that ranged from 5 to 20% of the total number of accessions (Brown, 1989; Schoen and Brown, 1993; Brown and Spillane, 1999; van Hintum, 1999; van Hintum et al., 2000). For species such as perennial ryegrass (Lolium perenne L.), Charmet and Balfourier (1995) found that a sampling intensity of 5 to 10% is optimal for capturing 86 to 90% of the diversity. However, for forming core collection of Medicago species, Diwan et al. (1995) pointed out that sampling intensities of 5 to 10% are insufficient to represent the original collection.
A stratified sampling method partitions the collection into clusters or groups, and then accessions within each cluster are selected. Several authors have recommended stratified sampling strategies for managing genetic resources and forming core subsets (Peeters and Martinelli, 1989; Crossa et al., 1994, 1995a; Spagnoletti Zeuli and Qualset, 1993; Charmet and Balfourier, 1995; Rincon et al., 1996). Statistical methods for stratifying genetic resources using three-way data (accessions x trait x location), with the purpose of forming core subsets, have been discussed by Crossa et al. (1995b) and, more recently, by Franco et al. (2003).
An allocation method provides criteria for determining the number of accessions to be selected from each cluster. For core subsets, Brown (1989) described three allocation methods whose sample sizes are (i) constant (or fixed) across clusters, (ii) proportional to the cluster size, and (iii) proportional to the logarithm of the cluster size. Brown (1989) also compared simple versus stratified sampling methods and recommended a stratified logarithmic method for choosing accessions from the collection. Finally, Brown (1989) proposed the logarithm of the cluster size (L method) as the allocation method. Yonezawa et al. (1995), Chandra et al. (2002), Diwan et al. (1995), and Zichao et al. (2002) have used the L method for sampling various crops. Diwan et al. (1994) formed core collections of 36 annual Medicago species and used an allocation method based on the diversity for the variables measured. The number of clusters formed in each species determined the diversity within species.
The main objectives of this study were to propose an allocation method (D method) for selecting accessions from the clusters (obtained by the Ward-MLM two stage strategy) and to compare it with other allocation methods (L, LD, and NY methods) with the aim of determining which one forms core subsets that best retain the diversity contained in the original collection. The four allocation methods determine sample size on the basis of different characteristics: (i) the D method: sample size proportional to the mean Gower distances between accessions within the cluster, (ii) the L method [proposed by Brown (1989)]: sample size proportional to the logarithm of the cluster size, (iii) the NY method [a modification of Neyman's (1934) method]: sample size proportional to the product of the cluster size times the mean Gower distance, and (iv) the LD method [a modification of Neyman's (1934) method]: sample size proportional to the product of the logarithm of the cluster size times the mean Gower distance. Five hundred independent stratified random samples under two sampling intensities, 10 and 20%, were obtained from three maize (Zea mays L.) collections and one maize population to compare the ability of the four allocation methods to retain the diversity of the collections.
| MATERIALS AND METHODS |
|---|
|
|
|---|
For k variables (k = 1,2,...,p), Gower's similarity measurement between two individuals i and j is:
![]() |
The Gower distance can be used as a diversity measure for a set of individuals (genotypes, accessions, etc.), with the important advantage that all types of variables can be used. Two genotypes with distances near zero show low diversity, whereas values near 1 indicate very diverse individuals.
The D Allocation Method
The D allocation method proposed in this study determines that the size of the sample to be drawn from each cluster should be proportional to the mean Gower distance between individuals within that cluster. Therefore, the number of accessions selected from each cluster will be proportional to the within-group diversity measured as the mean Gower distance between accessions within that group. More diverse groups will have a larger mean Gower's distance and therefore larger samples will have to be drawn from them.
For t = 1,2,...,g clusters, the number of accessions (nt) to be drawn from the tth cluster (nt) is
![]() | [1] |
t is the mean Gower distance between accessions within the tth cluster.
The L Allocation Method
The L allocation method uses the logarithm of the size of the cluster tth (Nt) to obtain the sample size of the tth cluster (nt)
![]() | [2] |
The NY and LD Allocation Methods
Neyman (1934) proposed an optimal allocation method for estimating, with minimum variance, the mean value of the variables in each cluster via stratified samples. The method determines that the size of the sample to be drawn from each cluster is proportional to the cluster size (Nt) and the standard deviation of the variable of interest, St, such that nt = n x
. It recovers as much of the diversity present in the collection as possible by using the standard deviation of the variables in the cluster as the diversity measurement.
To make the Neyman (1934) optimal allocation method comparable with the other allocation methods, it was modified in two ways. First, the sample size of the tth cluster (Nt) was weighted by the diversity measured as the mean Gower distance (
t). This allocation method was named the NY method and is represented by
![]() | [3] |
Second, to smooth out the effect of cluster size, the logarithm of Nt was weighted by the diversity of the tth cluster measured as the mean Gower distance (
t). This method was named the LD method
![]() | [4] |
The Ward-MLM Sequential Clustering Strategy
The initial groups formed by any hierarchical (geometric) clustering technique are based on the principle that rules such a technique; for example, the minimum variance within groups of the initial technique is Ward. Geometric clustering methods can be used with continuous and/or discrete variables by means of Gower's distance.
Statistical classification methods use the concept of mixture models. An initial classification of the individuals into g clusters is given so that each group is one of the distributions in the mixture. The vector with the mean of the traits and the variancecovariance matrix within clusters are estimated by the maximum-likelihood method. The maximization of the likelihood function begins at a point that has been reached using the geometric technique; it will then reach a peak (which could be local) near the starting point that contains the characteristics of the geometric technique.
The Modified Location Model is a mixture model developed by Franco et al. (1998) that uses continuous and discrete variables simultaneously. The Ward-MLM sequential clustering strategy forms the initial groups using the Ward method and then improves them by the MLM, the idea being that the MLM method will modify the groups initially formed by the Ward method, so that the final classification is a statistical one.
The Ward strategy is the recommended geometric clustering method to use in the two-stage clustering strategy because (i) the objective function of the Ward strategy is to minimize the variance within clusters, whereas the objective function of the mixture distribution model is to maximize the likelihood of which the variance within a cluster is a component, (ii) the direct relationship between the Ward strategy and the multivariate analysis of variance technique are based on the result that the total variance is equal to the variance between clusters plus the variance within clusters, and (iii) the objective function of the Ward strategy allows producing spherical clusters, whereas the mixture distribution model allows the formation of clusters of another shape. Thus, the sequential clustering strategy allows the MLM to modify the form of the initial groups obtained by the Ward strategy to one that permits the formation of more homogeneous groups.
Determining the Number of Clusters in the Ward-MLM Method
The number of groups was determined by, first, the pseudo-F criterion (SAS Institute, 2000), which, for each division into g groups, the following ratio is computed:
![]() |
Then, we used the graph of the likelihood profile (related to the likelihood ratio test) for different values of g near the value obtained by the pseudo-F, and observed the maximum growth point of the likelihood profile as a criterion for determining the definitive number of groups. The optimal number of groups was then determined using the pseudo-F approach combined with the log-likelihood profile.
Datasets
In this study, three collections having different sizes (N), different values of diversity, and different numbers of clusters (g) were used (Taba et al., 1999). The Guatemalan collection had N = 100 accessions and the Ward-MLM strategy formed g = 5 clusters. The Brazilian collection comprised N = 652 accessions and the Ward-MLM strategy formed g = 13 clusters. The collection from Mexico had N = 1460 accessions and g = 17 were formed (Table 1). These datasets contained five continuous variables (days to anthesis, days to silking, plant and ear height, and grain moisture), two nominal variables (kernel color and texture) and two binary variables [number of ears per plant equals 0 when less than or equal to 1, and 1 when it was more than 1; ear quality rating (19) assigned the value of 0 when it was less than or equal to 4.5, and 1 when it was more than 4.5].
|
Independent Stratified Random Samples
The allocation methods define how many, but not which specific, accessions per cluster should be sampled. The proposed D allocation method was evaluated and compared with the L, LD, and NY allocation methods by randomly drawing 500 samples from three maize collections and one maize gene pool. First, accessions from each of the four datasets were classified by Ward-MLM. Second, from each classified dataset, 500 independent stratified random samples (without replacement) were drawn, for each of the factorial combinations of two sampling intensities (10 and 20% of the entire collection) and the four allocation methods (D, L, LD, and NY). This was done by the SURVEYSELECT procedure of SAS (SAS Institute, 2000) and a computational code written in SAS procedure in IML (SAS Institute, 2000). Values were computed for the criteria used to compare the four allocation methods (see below). For each of the 500 samples, accessions within each cluster in each of the four datasets were selected at random.
Criteria for Comparing the Allocation Methods
A sampling strategy aims (i) to define a sampling intensity and an allocation method that will retain in the sample most of the collection diversity and (ii) to produce a sample with maximum variance and maximum distance between accessions, as compared with the variance and distances between accessions in the entire collection. The criteria we used for comparing the D method with the L, LD, and NY methods are described as follows.
Diversity of the Sample
The best allocation method is the one that produces a sample with a greater mean Gower distance among accessions (
S). For allocation methods, sampling intensities, and allocation method x sampling intensity interactions, the mean Gower distances across 500 independent random samples were statistically compared.
Recovery of the Range in the Sample
The recovery of the range (RR) for all variables (discrete and continuous) is given by RR = 
pk=1
, where Rnk, and RNk are the ranges of the kth variable in the sample and in the entire collection, respectively, for k = 1, 2, ...,p variables. An allocation method is better if it selects a sample with an RR near 1. The mean recovery of the range (
) values for allocation methods, sampling intensities, and allocation method x sampling intensity interactions were also statistically compared.
Variances of the Samples
An optimal allocation method should produce samples with high variance among the accessions. The variance of the accessions in the sample was measured for the five continuous variables: days to anthesis (DA), days to silking (DS), plant height (PH), ear height (EH), and grain moisture (GM). Thus, differences in the mean variances of each continuous variable,
,
,
,
,
, for allocation methods, sampling intensities, and allocation method x sampling intensity interactions were statistically assessed.
Comparing Allocation Methods
Analyses of variance for each dataset considered the allocation method, the sampling intensity, and the allocation method x sampling intensity interaction as fixed effects. Comparisons between allocation methods were performed across sample intensities and within sampling intensity. The dependent variables were the criteria used to evaluate the allocation methods: diversity of the sample measured by the mean Gower distance among accessions in the sample (
s), the recovery of the range in the sample (
), and the variance of the sample for five continuous variables DA, DS, PH, EH, and GM (
,
,
,
,
, respectively).
Pairwise comparisons of allocation methods across sampling intensities and within sampling intensity were made for
s,
,
,
,
,
, and
using the Tukey's studentized range test.
Ranking the Allocation Methods
The Friedman two-way test (Conover, 1971) was performed, within each sample intensity, for testing the null hypothesis
s,
,
,
,
,
, and
is equally likely (i.e., there is not a consistent order among allocation methods) versus the alternative hypothesis,
Comparing Allocation Methods with the Entire Collection
On the basis of the criteria described above, we compared the four allocation methods with the entire collection in each of the 500 independent stratified random samples.
It is expected that the mean Gower distance between accessions in the sample is greater than that between accessions in the entire collection. This is due to the fact that while the sample preserves diversity, it also has fewer redundant accessions. Thus, if the sample has a good representation of the diversity in the collection but fewer redundant accessions, its mean Gower distance will be greater than the mean Gower distance in the entire collection. If the mean Gower distance between accessions of the entire collection is
c, then a good performance criterion is when the mean Gower distance between the selected accessions forming the sample (
s) is greater than
c + 0.1
c or
c + 0.2
c or
c + 0.3
c.
Concerning the recovery of the range (RR) of the variables in the sample, an allocation method is better if it selects a sample with high RR. Regarding the variances of the variables in the sample, a procedure is better if it produces samples with higher variances than the variance among accessions in the entire collection. We used the criteria S2S
, S2S
, and S2S
where S2S and S2C are the variances for the sample and the entire collection, respectively, for each continuous variable. In the sampling study, the number of times that S2S
, S2S
, and S2S
were recorded.
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
t) between accessions within each cluster than the average of the Gower distances between accessions in the entire collection (
) for the four datasets (Table 1). The dataset from Mexico showed the highest number of observations, number of groups, and the lowest values for within cluster (
t = 0.33) and total average (
= 0.44) distances. The Guatemala dataset had the lowest number of observations and lowest number of groups, whereas the Brazil and Pool 25 datasets had the highest values for
and
t, respectively. The values of
t for each individual cluster in all datasets were always smaller than the average distance between accessions in the entire collection (
), except for two clusters (3 and 5) in the Mexico collection (Table 2). When the allocation method requires a sample size larger than the size of the cluster then fewer accessions will be sampled. This is the case in the Mexico collection where the D method resulted in selecting fewer accessions from cluster 5 (17) than clusters 2, 3, 9, 10, 15, and 17, even though cluster 5 had the greatest
t. These results indicate that the Ward-MLM sequential clustering strategy formed homogeneous groups.
|
In general, the NY method tends to form groups of very different sizes. For example, in the Mexico collection the group size ranged from 3 to 73. In contrast, methods D and LD formed groups less diverse in size. For example, with the D method, the size of the groups ranged from 13 to 24, and with LD method, from 13 to 25.
The size of samples drawn from each cluster using the D allocation method is based on the diversity of the cluster (
t) and not on its size (Nt) (Table 2). For example, for the Mexico collection, Group 6 had Nt = 450 accessions with the lowest diversity
t = 0.25; the D method allocated 13 accessions to this group, whereas LD, NY, and L methods allocated 21, 73, and 28 accessions, respectively. On the other hand, Mexico Groups 3 and 5 had Nt = 29 and Nt = 17 accessions, respectively, and the two highest diversity values:
t = 0.47 and
t = 0.48, respectively; the D method allocated 24 and 17 accessions to Groups 3 and 5, respectively, but the other allocation methods assigned a smaller number of accessions to these clusters. Similarly, for the Brazil collection, Group 9 had Nt = 106 and
t = 0.24 and Group 13 comprised Nt = 50 and had
t = 0.48; the D method assigned 6 accessions to Group 9 and 12 to Group 13.
Comparing Allocation Methods
Diversity of the Sample
The mean Gower distances between accessions across the 500 samples (
s) were higher than the respective mean Gower distance between accessions in the entire collection for the four datasets and for each allocation methodsampling intensity combination (Table 3). The minimum value of the 500 samples for all datasets and allocation methods was always larger than the mean Gower distance between accessions of the corresponding datasets. These results indicate that all allocation methods selected samples formed by a well-differentiated group of accessions.
|
0.01) between levels of allocation method, sampling intensity, allocation method x sampling intensity interaction and allocation methods within sampling intensities effects (data not shown). For all datasets and both sampling intensities, the Tukey's test indicated that
s of the D method was always significantly higher (P
0.01) than
s of the other allocation methods (Table 3). When combining the allocation methods across both sampling intensities,
s of the D method was significantly superior to
s of the other allocation methods for all datasets (data not shown). For all datasets, the
s of the D method produced with sampling intensity of 10% was significantly higher than the
s of samples generated with 20% sampling intensity (data not shown). The distribution of the mean Gower distances (mean D) from 500 samples is shown as box plots in Fig. 1. The D method produced the highest values for all datasets and for both sampling intensities (10% and 20%). In general, a 10% sampling intensity generated samples with higher mean Gower distance than the 20% sampling intensity, for all allocation methods and collections. Thus, for these datasets and this diversity criterion, a 20% sampling intensity resulted in redundant information, and the 10% sampling intensity was sufficient for representing collection's diversity.
|
0.01) between levels of allocation method, sampling intensity, allocation method x sampling intensity interaction and allocation methods within sampling intensities effects in all datasets (data not shown). The Tukey's test indicated that
of the D method was always significantly higher (P
0.01) than
of the other allocation methods (Table 3) in all datasets except Brazil in both sampling intensities. Averaged across sampling intensities, the D method had
values significantly larger than the
values of the other allocation methods for all datasets except Brazil (
of the D and L methods were similar). In all datasets, the
for 20% sampling intensity (across allocation methods) was significantly larger than the
for 10% sampling intensity (data not shown). The distribution of the RR values from 500 samples is shown as box plots in Fig. 2. In general, a 20% sampling intensity generated samples with better RR values than the 10% sampling intensity, for all allocation methods and collections (Fig. 2).
|
0.01) for all datasets for the mean variances of the five variables. The Tukey's test indicated that the values of
,
,
,
, and
were significantly larger with the D method than the other methods in most cases, except for: 1)
in Guatemala, Brazil, and Pool 25 for both sampling intensities; 2)
,
, and
in Brazil for 20% sampling intensity; 3)
and
in Pool 25 for 10% and 20% sampling intensities (Table 3).
The mean variances of the variables for all datasets and allocation methods tended to be larger for 10% sampling intensity than for 20% sampling intensity (Table 3). When the allocation methods are averaged across sampling intensities, the values of
and
for the D method were significantly larger than those of the other allocation methods (data not shown). For
and
the D method significantly differed from the other methods, except in Pool 25. For
the D method differed from the others only in Mexico and Pool 25.
Ranking the Allocation Methods
The D allocation method ranked consistently first for
s and
variables for all datasets and sample intensities. The D method ranked first in most of the variances of the five continuous variables, except for
in Guatemala under both sample intensities and for
in Brazil and Pool 25 under 20% sample intensity. The mean rank of each allocation method in each dataset and sample intensity is shown in Table 3. The Friedman test for each dataset and sample intensity determined that the data are consistent with the hypothesis that the D allocation method performed consistently higher than the other allocation methods for all seven variables.
Comparing Allocation Methods with the Entire Collection
Diversity of the Sample
Across datasets and sampling intensities, the D allocation method produced a larger percentage of samples with
s
than the other allocation methods at both sampling intensities (Table 4). For the interval
the D method was superior to the other methods only in Mexico (at both sampling intensities) and in Guatemala with 10% sampling intensity.
|
Variances of the Samples
For all datasets and sampling intensities, the D method resulted in the highest percentage of the 500 samples with variances among the accessions in the sample (S2S) that were greater than the values for
,
, and
(data not shown). The only exception was for the variable GM in the Guatemala collection. It is interesting that for all datasets, the D method tended to generate more diverse samples than the other methods as the width of the interval increased from 10% to 50%. These results indicate that the D method produced samples with maximum variance and maximum distance between accessions as compared with the variance and the distances between accessions in the entire collection.
| CONCLUSIONS |
|---|
|
|
|---|
In most cases, for the response variables
s,
,
,
,
,
, and
, the D allocation method ranked first. The mean rank of the D allocation method was statistically higher than the mean rank of the other allocation methods. Concerning the sampling intensities, the results of this study indicated that for
s,
,
,
,
,
a sample of 10% of the entire collection is sufficient for preserving the diversity of the collection, whereas results based on
showed that a sampling intensity of 20% preserves more of the diversity.
In this study, accessions from each cluster were randomly selected according to the sample size determined by the four allocation methods. However, allocation methods do not define which specific accessions should be sampled. Accessions can be selected from each cluster on the basis of other criteria as well, such as general agronomic performance, grain yield, and general plant type. Some researchers may decide to select the best performing accessions to be crossed with line testers or elite germplasm sources, and then initiate a prebreeding program. For example, the D method can be combined with an agronomic selection criterion for selecting accessions from each cluster.
The D method can be used with any clustering strategy and any distance measure. In this study the clustering strategy was the Ward-MLM used with continuous and discrete variables; the only distance that can be used for such data is Gower's distance, which is thus the distance that should be used in the D allocation method. The D method may be useful not only for sampling genetic diversity in crop germplasm collections but also in other areas of research where a stratified sampling method is required for preserving as much of the original population's diversity as possible.
The Ward-MLM strategy can use phenotypic and genetic marker data simultaneously, as shown by Franco et al. (2001). Using only molecular markers and/or DNA sequence data, various genetic distances and hierarchical clustering algorithms can be employed, and various allocation methods evaluated. Results can be validated based on phenotypic data, as was done by McKhann et al. (2004). However, further research is needed to assess the usefulness of the D allocation method using only marker data and to compare it with other allocation methods that do not use stratified sampling such as the M (maximization) strategy proposed by Schoen and Brown (1993).
Received for publication May 12, 2004.
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
J. Franco, J. Crossa, M. L. Warburton, and S. Taba Sampling Strategies for Conserving Maize Diversity When Forming Core Subsets Using Genetic Markers Crop Sci., February 24, 2006; 46(2): 854 - 864. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| The SCI Journals | Agronomy Journal | Vadose Zone Journal | |||
| Journal of Natural Resources and Life Sciences Education |
Soil Science Society of America Journal | ||||
| Journal of Plant Registrations | Journal of Environmental Quality |
The Plant Genome | |||