Crop Science Journal of Natural Resources and Life Sciences Education
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Published online 6 May 2005
Published in Crop Sci 45:1035-1044 (2005)
© 2005 Crop Science Society of America
677 S. Segoe Rd., Madison, WI 53711 USA
This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via ISI Web of Science (6)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Franco, J.
Right arrow Articles by Shands, H.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Franco, J.
Right arrow Articles by Shands, H.
Agricola
Right arrow Articles by Franco, J.
Right arrow Articles by Shands, H.
Related Collections
Right arrow Biometrics

PLANT GENETIC RESOURCES

A Sampling Strategy for Conserving Genetic Diversity when Forming Core Subsets

Jorge Francoa, José Crossab,*, Suketoshi Tabac and Henry Shandsd

a Facultad de Agronomía, Universidad de la República, Av. Garzón 780 CP 12900, Montevideo, Uruguay
b Biometrics and Statistics Unit, CIMMYT, Apdo. Postal 6-641, 06600, Mexico DF, Mexico
c Maize Genetic Resources Unit, CIMMYT, Mexico
d National Center of Genetic Resources Preservation (NCGRP), USDA, ARS, Fort Collins, CO 80523

* Corresponding author (j.crossa{at}cgiar.org)


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
When forming core subsets, accessions from a collection are classified into clusters, and then samples are drawn from the clusters with the aim of maintaining the diversity of the collection. In a stratified sampling strategy, the allocation method provides a criterion for determining the number of accessions to be selected from each cluster. This paper proposes an allocation method (D method) and compares it with three other allocation methods (L, LD, and NY methods). In these allocation methods, the number of accessions sampled per cluster is proportional to (i) the mean of the Gower's distance between accessions within the cluster (D method), (ii) the logarithm of the cluster size (L method), (iii) the product of the cluster size times the mean Gower distance (NY method), and (iv) the product of the logarithm of the cluster size times the mean Gower distance (LD method). Five hundred independent stratified random samples with two sampling intensities (10 and 20%) were obtained from four datasets. The allocation methods were compared on the basis of three criteria: diversity of the samples, recovery of the range of variables in the sample, and variances of the samples. Results showed that the D method produced samples (i) with significantly more diversity than the other allocation methods, (ii) that recovered more of the range of the variables, (iii) with higher variances for the continuous variables than the other three methods, and (iv) with variances higher than the variance among accessions of the collection. A sampling intensity of 10% preserves the same or more variability than a sampling intensity of 20%.

Abbreviations: DA, days to anthesis • DS, days to silking • EH, ear height • GM, grain moisture • MLM, Modified Location Model • PH, plant height


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
GENETIC RESOURCES stored in gene banks are usually sampled to foster efficient evaluation and utilization of the collections as well as to study phenotypic and genotypic diversity, form core subsets, and eliminate redundant and duplicate accessions within a collection. The main purpose of these activities is to preserve in the sample as much of the diversity present in the original collection as possible (Crossa et al., 1995a). For example, the approach of forming core collections (core subsets) was introduced to increase the efficiency of describing and using collections stored in gene banks, while preserving as much as possible the diversity of the entire collection (Frankel and Brown, 1984; Brown, 1989).

The process of sampling genetic resources with the objective of forming core subsets starts with grouping accessions to obtain homogeneous within and heterogeneous between clusters (or groups) and then using a predetermined sampling strategy within each cluster. The grouping of accessions into clusters is achieved by a classification strategy that partitions the original collection into groups with maximum distances between accessions located in different groups and minimum distances between accessions located in the same group. Franco et al. (1998)(1999, 2002) and Franco and Crossa (2002) proposed a sequential Ward-Modified Location Method (MLM) strategy in which the Gower (1971) distance is used as a measure of similarity (or distance) among accessions considering all continuous and categorical variables. The initial groups were formed by the Ward (1963) method, and then the MLM was used to improve those groups. The Ward-MLM strategy was used for analyzing the Latin American Maize Project (Taba et al., 1999) and Caribbean maize collections (Taba et al., 1998) with data from more than 10 countries and with the number of observations per collection ranging from 100 to 1800 and a mixture of continuous and discrete variables. These studies demonstrated that the Ward-MLM formed compact and well separated clusters.

The reason for sampling accessions when forming core subsets is to identify a strategy that will structure a sample that recovers most of the diversity contained in the original collection, while maximizing the variance and the distances between accessions in the sample. A sampling strategy involves defining a sampling intensity, a sampling method, and an allocation method (Thompson, 2002).

The sampling intensity defines the overall sample size, and for core collections, several authors studied sampling intensities that ranged from 5 to 20% of the total number of accessions (Brown, 1989; Schoen and Brown, 1993; Brown and Spillane, 1999; van Hintum, 1999; van Hintum et al., 2000). For species such as perennial ryegrass (Lolium perenne L.), Charmet and Balfourier (1995) found that a sampling intensity of 5 to 10% is optimal for capturing 86 to 90% of the diversity. However, for forming core collection of Medicago species, Diwan et al. (1995) pointed out that sampling intensities of 5 to 10% are insufficient to represent the original collection.

A stratified sampling method partitions the collection into clusters or groups, and then accessions within each cluster are selected. Several authors have recommended stratified sampling strategies for managing genetic resources and forming core subsets (Peeters and Martinelli, 1989; Crossa et al., 1994, 1995a; Spagnoletti Zeuli and Qualset, 1993; Charmet and Balfourier, 1995; Rincon et al., 1996). Statistical methods for stratifying genetic resources using three-way data (accessions x trait x location), with the purpose of forming core subsets, have been discussed by Crossa et al. (1995b) and, more recently, by Franco et al. (2003).

An allocation method provides criteria for determining the number of accessions to be selected from each cluster. For core subsets, Brown (1989) described three allocation methods whose sample sizes are (i) constant (or fixed) across clusters, (ii) proportional to the cluster size, and (iii) proportional to the logarithm of the cluster size. Brown (1989) also compared simple versus stratified sampling methods and recommended a stratified logarithmic method for choosing accessions from the collection. Finally, Brown (1989) proposed the logarithm of the cluster size (L method) as the allocation method. Yonezawa et al. (1995), Chandra et al. (2002), Diwan et al. (1995), and Zichao et al. (2002) have used the L method for sampling various crops. Diwan et al. (1994) formed core collections of 36 annual Medicago species and used an allocation method based on the diversity for the variables measured. The number of clusters formed in each species determined the diversity within species.

The main objectives of this study were to propose an allocation method (D method) for selecting accessions from the clusters (obtained by the Ward-MLM two stage strategy) and to compare it with other allocation methods (L, LD, and NY methods) with the aim of determining which one forms core subsets that best retain the diversity contained in the original collection. The four allocation methods determine sample size on the basis of different characteristics: (i) the D method: sample size proportional to the mean Gower distances between accessions within the cluster, (ii) the L method [proposed by Brown (1989)]: sample size proportional to the logarithm of the cluster size, (iii) the NY method [a modification of Neyman's (1934) method]: sample size proportional to the product of the cluster size times the mean Gower distance, and (iv) the LD method [a modification of Neyman's (1934) method]: sample size proportional to the product of the logarithm of the cluster size times the mean Gower distance. Five hundred independent stratified random samples under two sampling intensities, 10 and 20%, were obtained from three maize (Zea mays L.) collections and one maize population to compare the ability of the four allocation methods to retain the diversity of the collections.


    MATERIALS AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
The Gower Distance
Gower (1971) proposed a similarity measure between the ith and the jth individuals, sij, that can use simultaneously continuous, ordinal, binary, and nominal variables. The author showed that a sufficient condition for the distance [dij = (1 – sij)1/2] between two individuals to be a Euclidean metric is the positive semi-definite property of the similarity matrix S = {sij}. In addition, the author showed that the similarity matrix S is positive semi-definite when there are no missing values in the data.

For k variables (k = 1,2,...,p), Gower's similarity measurement between two individuals i and j is:

where wijk is a weight given to the ijkth comparison, assigning values of 1 for valid comparisons and a value of 0 for invalid comparisons (when the value of the variable is missing in one or both individuals); sijk is the contribution of the kth variable to the total similarity between individuals i and j, and it takes values between 0 and 1. For a nominal variable, if the value of the kth variable is the same for both individuals, i and j, then sijk = 1; otherwise, it equals 0; for a continuous variable sijk = 1 – |xik xjk|/Rk where xik and xjk are the values of the kth variable for the i and j individuals, respectively, and Rk is the range (maximum value minus minimum value) of the kth variable in the sample. The division by Rk eliminates scale differences among variables, producing a value within the [0,1] interval and equal weights. The similarity value for binary characters is equal to the proportion of characters for which the two individuals agree, excluding the absence–absence agreement.

The Gower distance can be used as a diversity measure for a set of individuals (genotypes, accessions, etc.), with the important advantage that all types of variables can be used. Two genotypes with distances near zero show low diversity, whereas values near 1 indicate very diverse individuals.

The D Allocation Method
The D allocation method proposed in this study determines that the size of the sample to be drawn from each cluster should be proportional to the mean Gower distance between individuals within that cluster. Therefore, the number of accessions selected from each cluster will be proportional to the within-group diversity measured as the mean Gower distance between accessions within that group. More diverse groups will have a larger mean Gower's distance and therefore larger samples will have to be drawn from them.

For t = 1,2,...,g clusters, the number of accessions (nt) to be drawn from the tth cluster (nt) is

[1]
where n is the total sample size to be drawn from the collection (which in this study will be 10% or 20% of the entire collection), pt is the proportion of the sample size to be drawn from the tth cluster, and t is the mean Gower distance between accessions within the tth cluster.

The L Allocation Method
The L allocation method uses the logarithm of the size of the cluster tth (Nt) to obtain the sample size of the tth cluster (nt)

[2]
with n as the total sample size (10 or 20%). The L method was proposed by Brown (1989) and later used by Yonezawa et al. (1995), Chandra et al. (2002), and Zichao et al. (2002).

The NY and LD Allocation Methods
Neyman (1934) proposed an optimal allocation method for estimating, with minimum variance, the mean value of the variables in each cluster via stratified samples. The method determines that the size of the sample to be drawn from each cluster is proportional to the cluster size (Nt) and the standard deviation of the variable of interest, St, such that nt = n x . It recovers as much of the diversity present in the collection as possible by using the standard deviation of the variables in the cluster as the diversity measurement.

To make the Neyman (1934) optimal allocation method comparable with the other allocation methods, it was modified in two ways. First, the sample size of the tth cluster (Nt) was weighted by the diversity measured as the mean Gower distance (t). This allocation method was named the NY method and is represented by

[3]

Second, to smooth out the effect of cluster size, the logarithm of Nt was weighted by the diversity of the tth cluster measured as the mean Gower distance (t). This method was named the LD method

[4]

The Ward-MLM Sequential Clustering Strategy
The initial groups formed by any hierarchical (geometric) clustering technique are based on the principle that rules such a technique; for example, the minimum variance within groups of the initial technique is Ward. Geometric clustering methods can be used with continuous and/or discrete variables by means of Gower's distance.

Statistical classification methods use the concept of mixture models. An initial classification of the individuals into g clusters is given so that each group is one of the distributions in the mixture. The vector with the mean of the traits and the variance–covariance matrix within clusters are estimated by the maximum-likelihood method. The maximization of the likelihood function begins at a point that has been reached using the geometric technique; it will then reach a peak (which could be local) near the starting point that contains the characteristics of the geometric technique.

The Modified Location Model is a mixture model developed by Franco et al. (1998) that uses continuous and discrete variables simultaneously. The Ward-MLM sequential clustering strategy forms the initial groups using the Ward method and then improves them by the MLM, the idea being that the MLM method will modify the groups initially formed by the Ward method, so that the final classification is a statistical one.

The Ward strategy is the recommended geometric clustering method to use in the two-stage clustering strategy because (i) the objective function of the Ward strategy is to minimize the variance within clusters, whereas the objective function of the mixture distribution model is to maximize the likelihood of which the variance within a cluster is a component, (ii) the direct relationship between the Ward strategy and the multivariate analysis of variance technique are based on the result that the total variance is equal to the variance between clusters plus the variance within clusters, and (iii) the objective function of the Ward strategy allows producing spherical clusters, whereas the mixture distribution model allows the formation of clusters of another shape. Thus, the sequential clustering strategy allows the MLM to modify the form of the initial groups obtained by the Ward strategy to one that permits the formation of more homogeneous groups.

Determining the Number of Clusters in the Ward-MLM Method
The number of groups was determined by, first, the pseudo-F criterion (SAS Institute, 2000), which, for each division into g groups, the following ratio is computed:

where tr(B) and tr(W) are the traces of the matrices of the sums of squares and cross products between and within groups, respectively. The number, g, of groups is selected in relation to the maximum value.

Then, we used the graph of the likelihood profile (related to the likelihood ratio test) for different values of g near the value obtained by the pseudo-F, and observed the maximum growth point of the likelihood profile as a criterion for determining the definitive number of groups. The optimal number of groups was then determined using the pseudo-F approach combined with the log-likelihood profile.

Datasets
In this study, three collections having different sizes (N), different values of diversity, and different numbers of clusters (g) were used (Taba et al., 1999). The Guatemalan collection had N = 100 accessions and the Ward-MLM strategy formed g = 5 clusters. The Brazilian collection comprised N = 652 accessions and the Ward-MLM strategy formed g = 13 clusters. The collection from Mexico had N = 1460 accessions and g = 17 were formed (Table 1). These datasets contained five continuous variables (days to anthesis, days to silking, plant and ear height, and grain moisture), two nominal variables (kernel color and texture) and two binary variables [number of ears per plant equals 0 when less than or equal to 1, and 1 when it was more than 1; ear quality rating (1–9) assigned the value of 0 when it was less than or equal to 4.5, and 1 when it was more than 4.5].


View this table:
[in this window]
[in a new window]
 
Table 1. Collection, number of accessions in the collection (N), number of clusters found by the Ward-MLM strategy (g), mean Gower distance between the N accessions of the entire collection (), mean Gower distance between accessions within clusters (t).

 
Another dataset, Pool 25 (Taba et al., 2001), with more variables than the other three, was also included (N = 210, g = 7) (Table 1). Pool 25 is a late tropical, yellow flint CIMMYT maize gene pool that comprises S2 lines crossed with a tester so that the entries should be very uniform. The 12 continuous variables were days to anthesis and silking, plant and ear height, days to senescence, grain moisture at harvest, shelling percentage, ear length and diameter, kernel row number by ear, and kernel length and width; the four binary variables were ear rot (0 = low, 1 = high), ear appearance (0 = bad, 1 = good), foliar disease score (0 = low, 1 = high), and agronomic scale (0 = bad, 1 = good).

Independent Stratified Random Samples
The allocation methods define how many, but not which specific, accessions per cluster should be sampled. The proposed D allocation method was evaluated and compared with the L, LD, and NY allocation methods by randomly drawing 500 samples from three maize collections and one maize gene pool. First, accessions from each of the four datasets were classified by Ward-MLM. Second, from each classified dataset, 500 independent stratified random samples (without replacement) were drawn, for each of the factorial combinations of two sampling intensities (10 and 20% of the entire collection) and the four allocation methods (D, L, LD, and NY). This was done by the SURVEYSELECT procedure of SAS (SAS Institute, 2000) and a computational code written in SAS procedure in IML (SAS Institute, 2000). Values were computed for the criteria used to compare the four allocation methods (see below). For each of the 500 samples, accessions within each cluster in each of the four datasets were selected at random.

Criteria for Comparing the Allocation Methods
A sampling strategy aims (i) to define a sampling intensity and an allocation method that will retain in the sample most of the collection diversity and (ii) to produce a sample with maximum variance and maximum distance between accessions, as compared with the variance and distances between accessions in the entire collection. The criteria we used for comparing the D method with the L, LD, and NY methods are described as follows.

Diversity of the Sample
The best allocation method is the one that produces a sample with a greater mean Gower distance among accessions (S). For allocation methods, sampling intensities, and allocation method x sampling intensity interactions, the mean Gower distances across 500 independent random samples were statistically compared.

Recovery of the Range in the Sample
The recovery of the range (RR) for all variables (discrete and continuous) is given by RR = {sum}pk=1, where Rnk, and RNk are the ranges of the kth variable in the sample and in the entire collection, respectively, for k = 1, 2, ...,p variables. An allocation method is better if it selects a sample with an RR near 1. The mean recovery of the range () values for allocation methods, sampling intensities, and allocation method x sampling intensity interactions were also statistically compared.

Variances of the Samples
An optimal allocation method should produce samples with high variance among the accessions. The variance of the accessions in the sample was measured for the five continuous variables: days to anthesis (DA), days to silking (DS), plant height (PH), ear height (EH), and grain moisture (GM). Thus, differences in the mean variances of each continuous variable, , , , , , for allocation methods, sampling intensities, and allocation method x sampling intensity interactions were statistically assessed.

Comparing Allocation Methods
Analyses of variance for each dataset considered the allocation method, the sampling intensity, and the allocation method x sampling intensity interaction as fixed effects. Comparisons between allocation methods were performed across sample intensities and within sampling intensity. The dependent variables were the criteria used to evaluate the allocation methods: diversity of the sample measured by the mean Gower distance among accessions in the sample (s), the recovery of the range in the sample (), and the variance of the sample for five continuous variables DA, DS, PH, EH, and GM (, , , , , respectively).

Pairwise comparisons of allocation methods across sampling intensities and within sampling intensity were made for s, , , , , , and using the Tukey's studentized range test.

Ranking the Allocation Methods
The Friedman two-way test (Conover, 1971) was performed, within each sample intensity, for testing the null hypothesis

Comparing Allocation Methods with the Entire Collection
On the basis of the criteria described above, we compared the four allocation methods with the entire collection in each of the 500 independent stratified random samples.

It is expected that the mean Gower distance between accessions in the sample is greater than that between accessions in the entire collection. This is due to the fact that while the sample preserves diversity, it also has fewer redundant accessions. Thus, if the sample has a good representation of the diversity in the collection but fewer redundant accessions, its mean Gower distance will be greater than the mean Gower distance in the entire collection. If the mean Gower distance between accessions of the entire collection is c, then a good performance criterion is when the mean Gower distance between the selected accessions forming the sample (s) is greater than c + 0.1c or c + 0.2c or c + 0.3c.

Concerning the recovery of the range (RR) of the variables in the sample, an allocation method is better if it selects a sample with high RR. Regarding the variances of the variables in the sample, a procedure is better if it produces samples with higher variances than the variance among accessions in the entire collection. We used the criteria S2S ≥ , S2S ≥ , and S2S ≥ where S2S and S2C are the variances for the sample and the entire collection, respectively, for each continuous variable. In the sampling study, the number of times that S2S ≥ , S2S ≥ , and S2S ≥ were recorded.


    RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
The Ward-MLM method produced clusters with smaller mean Gower distances (t) between accessions within each cluster than the average of the Gower distances between accessions in the entire collection () for the four datasets (Table 1). The dataset from Mexico showed the highest number of observations, number of groups, and the lowest values for within cluster (t = 0.33) and total average ( = 0.44) distances. The Guatemala dataset had the lowest number of observations and lowest number of groups, whereas the Brazil and Pool 25 datasets had the highest values for and t, respectively. The values of t for each individual cluster in all datasets were always smaller than the average distance between accessions in the entire collection (), except for two clusters (3 and 5) in the Mexico collection (Table 2). When the allocation method requires a sample size larger than the size of the cluster then fewer accessions will be sampled. This is the case in the Mexico collection where the D method resulted in selecting fewer accessions from cluster 5 (17) than clusters 2, 3, 9, 10, 15, and 17, even though cluster 5 had the greatest t. These results indicate that the Ward-MLM sequential clustering strategy formed homogeneous groups.


View this table:
[in this window]
[in a new window]
 
Table 2. Sample size (nt) for the four allocation methods (D, LD, NY, and L) for a 20% sampling intensity and four datasets. Number of clusters (g), number of observations per cluster (Nt), and mean Gower distance per cluster (t).

 
Table 2 shows that, for Groups 4 and 5 from Mexico, the D and LD methods required a sample size equal to or larger than the group size because of the heterogeneity of the groups (high distance values) combined with a small group size. In these cases, the entire cluster was included. In Pool 25, the D method allocated the same number of accessions to all groups, and the mean Gower distances within clusters were very similar, ranging from 0.37 to 0.42 (Table 2). For Pool 25, the other methods did not allocate a similar number of accessions per cluster, as did the D method. These results are in agreement with the high uniformity of the entries comprising Pool 25.

In general, the NY method tends to form groups of very different sizes. For example, in the Mexico collection the group size ranged from 3 to 73. In contrast, methods D and LD formed groups less diverse in size. For example, with the D method, the size of the groups ranged from 13 to 24, and with LD method, from 13 to 25.

The size of samples drawn from each cluster using the D allocation method is based on the diversity of the cluster (t) and not on its size (Nt) (Table 2). For example, for the Mexico collection, Group 6 had Nt = 450 accessions with the lowest diversity t = 0.25; the D method allocated 13 accessions to this group, whereas LD, NY, and L methods allocated 21, 73, and 28 accessions, respectively. On the other hand, Mexico Groups 3 and 5 had Nt = 29 and Nt = 17 accessions, respectively, and the two highest diversity values: t = 0.47 and t = 0.48, respectively; the D method allocated 24 and 17 accessions to Groups 3 and 5, respectively, but the other allocation methods assigned a smaller number of accessions to these clusters. Similarly, for the Brazil collection, Group 9 had Nt = 106 and t = 0.24 and Group 13 comprised Nt = 50 and had t = 0.48; the D method assigned 6 accessions to Group 9 and 12 to Group 13.

Comparing Allocation Methods
Diversity of the Sample
The mean Gower distances between accessions across the 500 samples (s) were higher than the respective mean Gower distance between accessions in the entire collection for the four datasets and for each allocation method–sampling intensity combination (Table 3). The minimum value of the 500 samples for all datasets and allocation methods was always larger than the mean Gower distance between accessions of the corresponding datasets. These results indicate that all allocation methods selected samples formed by a well-differentiated group of accessions.


View this table:
[in this window]
[in a new window]
 
Table 3. Mean Gower distance between the accessions of the sample (s), mean recovery of the range in the sample (), mean of the variance for days to anthesis (), days to silking (), plant height (), ear height () and grain moisture () for two sampling intensities (10 and 20%), four allocation methods (D, LD, L, and NY) for four datasets and for the entire collection (Coll.). Mean rank across s, , , , , , (), and chi-square value for the Friedman test ({chi}2).

 
The analysis of variance showed that there were significant differences (P ≤ 0.01) between levels of allocation method, sampling intensity, allocation method x sampling intensity interaction and allocation methods within sampling intensities effects (data not shown). For all datasets and both sampling intensities, the Tukey's test indicated that s of the D method was always significantly higher (P ≤ 0.01) than s of the other allocation methods (Table 3). When combining the allocation methods across both sampling intensities, s of the D method was significantly superior to s of the other allocation methods for all datasets (data not shown). For all datasets, the s of the D method produced with sampling intensity of 10% was significantly higher than the s of samples generated with 20% sampling intensity (data not shown).

The distribution of the mean Gower distances (mean D) from 500 samples is shown as box plots in Fig. 1. The D method produced the highest values for all datasets and for both sampling intensities (10% and 20%). In general, a 10% sampling intensity generated samples with higher mean Gower distance than the 20% sampling intensity, for all allocation methods and collections. Thus, for these datasets and this diversity criterion, a 20% sampling intensity resulted in redundant information, and the 10% sampling intensity was sufficient for representing collection's diversity.



View larger version (28K):
[in this window]
[in a new window]
 
Fig. 1. Box plot representation of the mean Gower distance among accessions (meanD) for 500 samples using four allocation methods (D, LD, L, and NY) and two sampling intensities: 10%, and 20% (10D, 10LD, 10L, 10NY, 20D, 20LD, 20L, 20NY) for Mexico (1a), Brazil (1b), Pool 25 (1c), and Guatemala (1d) collections.

 
Recovery of the Range in the Sample
There were significant differences (P ≤ 0.01) between levels of allocation method, sampling intensity, allocation method x sampling intensity interaction and allocation methods within sampling intensities effects in all datasets (data not shown). The Tukey's test indicated that of the D method was always significantly higher (P ≤ 0.01) than of the other allocation methods (Table 3) in all datasets except Brazil in both sampling intensities. Averaged across sampling intensities, the D method had values significantly larger than the values of the other allocation methods for all datasets except Brazil ( of the D and L methods were similar). In all datasets, the for 20% sampling intensity (across allocation methods) was significantly larger than the for 10% sampling intensity (data not shown).

The distribution of the RR values from 500 samples is shown as box plots in Fig. 2. In general, a 20% sampling intensity generated samples with better RR values than the 10% sampling intensity, for all allocation methods and collections (Fig. 2).



View larger version (29K):
[in this window]
[in a new window]
 
Fig. 2. Box plot representation of the recovery of the range (RR) for 500 samples using four allocation methods (D, LD, L, and NY) and two sampling intensities: 10%, and 20% (10D, 10LD, 10L, 10NY, 20D, 20LD, 20L, 20NY) for Mexico (2a), Brazil (2b), Pool 25 (2c), and Guatemala (2d) collections.

 
Variances of the Samples
Not all the effects (sampling intensity, allocation method x sampling intensity interaction and allocation methods nested within sampling intensities effects) were significantly different for all mean variances of the five continuous variables in all the datasets. Only the different allocation methods were significantly different (P ≤ 0.01) for all datasets for the mean variances of the five variables. The Tukey's test indicated that the values of , , , , and were significantly larger with the D method than the other methods in most cases, except for: 1) in Guatemala, Brazil, and Pool 25 for both sampling intensities; 2) , , and in Brazil for 20% sampling intensity; 3) and in Pool 25 for 10% and 20% sampling intensities (Table 3).

The mean variances of the variables for all datasets and allocation methods tended to be larger for 10% sampling intensity than for 20% sampling intensity (Table 3). When the allocation methods are averaged across sampling intensities, the values of and for the D method were significantly larger than those of the other allocation methods (data not shown). For and the D method significantly differed from the other methods, except in Pool 25. For the D method differed from the others only in Mexico and Pool 25.

Ranking the Allocation Methods
The D allocation method ranked consistently first for s and variables for all datasets and sample intensities. The D method ranked first in most of the variances of the five continuous variables, except for in Guatemala under both sample intensities and for in Brazil and Pool 25 under 20% sample intensity. The mean rank of each allocation method in each dataset and sample intensity is shown in Table 3. The Friedman test for each dataset and sample intensity determined that the data are consistent with the hypothesis that the D allocation method performed consistently higher than the other allocation methods for all seven variables.

Comparing Allocation Methods with the Entire Collection
Diversity of the Sample
Across datasets and sampling intensities, the D allocation method produced a larger percentage of samples with s ≥ than the other allocation methods at both sampling intensities (Table 4). For the interval the D method was superior to the other methods only in Mexico (at both sampling intensities) and in Guatemala with 10% sampling intensity.


View this table:
[in this window]
[in a new window]
 
Table 4. Percentage of the 500 samples showing a mean Gower distance between accessions (s) greater than , and (c = mean Gower distance between accessions of the entire collection) for two sampling intensities (10% and 20%), four allocation methods (D, LD, L, and NY) and four data sets. Percentage of samples showing a Recovery of the Range (RR) greater than 0.80(RR80)and 0.90(RR90).

 
Recovery of the Range in the Sample
The D method produced the same or a higher number of samples that recovered 80% (RR80) and 90% (RR90) of the range of variables included in the analysis than were produced by the other allocation methods, for all datasets and sampling intensities (Table 4). The exception was the Brazil collection with 10% sampling intensity, where the D method recovered 90% of the range in only 34% of the 500 samples, as compared with the NY method, which recovered 90% of the range in all 500 samples (Table 4).

Variances of the Samples
For all datasets and sampling intensities, the D method resulted in the highest percentage of the 500 samples with variances among the accessions in the sample (S2S) that were greater than the values for , , and (data not shown). The only exception was for the variable GM in the Guatemala collection. It is interesting that for all datasets, the D method tended to generate more diverse samples than the other methods as the width of the interval increased from 10% to 50%. These results indicate that the D method produced samples with maximum variance and maximum distance between accessions as compared with the variance and the distances between accessions in the entire collection.


    CONCLUSIONS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
This research proposes the D allocation method and compares it with other allocation methods with the objective of forming core subsets that will capture and, therefore, represent most of the diversity existing in the original collection. The D allocation method seems to be effective in structuring samples that will preserve the diversity of the original collection. In the three collections and Pool 25 and with both sampling intensities, the D method resulted in significantly larger mean Gower distances between accessions in the samples than the mean Gower distances between accessions in the samples obtained with other allocation methods. Results indicated that the D allocation method recovered significantly more of the range of variables in the sample than did the other allocation methods. In general, the D method generated samples with significantly larger variance than the other methods. The exception was grain moisture.

In most cases, for the response variables s, , , , , , and , the D allocation method ranked first. The mean rank of the D allocation method was statistically higher than the mean rank of the other allocation methods. Concerning the sampling intensities, the results of this study indicated that for s, , , , , a sample of 10% of the entire collection is sufficient for preserving the diversity of the collection, whereas results based on showed that a sampling intensity of 20% preserves more of the diversity.

In this study, accessions from each cluster were randomly selected according to the sample size determined by the four allocation methods. However, allocation methods do not define which specific accessions should be sampled. Accessions can be selected from each cluster on the basis of other criteria as well, such as general agronomic performance, grain yield, and general plant type. Some researchers may decide to select the best performing accessions to be crossed with line testers or elite germplasm sources, and then initiate a prebreeding program. For example, the D method can be combined with an agronomic selection criterion for selecting accessions from each cluster.

The D method can be used with any clustering strategy and any distance measure. In this study the clustering strategy was the Ward-MLM used with continuous and discrete variables; the only distance that can be used for such data is Gower's distance, which is thus the distance that should be used in the D allocation method. The D method may be useful not only for sampling genetic diversity in crop germplasm collections but also in other areas of research where a stratified sampling method is required for preserving as much of the original population's diversity as possible.

The Ward-MLM strategy can use phenotypic and genetic marker data simultaneously, as shown by Franco et al. (2001). Using only molecular markers and/or DNA sequence data, various genetic distances and hierarchical clustering algorithms can be employed, and various allocation methods evaluated. Results can be validated based on phenotypic data, as was done by McKhann et al. (2004). However, further research is needed to assess the usefulness of the D allocation method using only marker data and to compare it with other allocation methods that do not use stratified sampling such as the M (maximization) strategy proposed by Schoen and Brown (1993).

Received for publication May 12, 2004.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 




This article has been cited by other articles:


Home page
Crop Sci.Home page
J. Franco, J. Crossa, M. L. Warburton, and S. Taba
Sampling Strategies for Conserving Maize Diversity When Forming Core Subsets Using Genetic Markers
Crop Sci., February 24, 2006; 46(2): 854 - 864.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via ISI Web of Science (6)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Franco, J.
Right arrow Articles by Shands, H.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Franco, J.
Right arrow Articles by Shands, H.
Agricola
Right arrow Articles by Franco, J.
Right arrow Articles by Shands, H.
Related Collections
Right arrow Biometrics


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
The SCI Journals Agronomy Journal Vadose Zone Journal
Journal of Natural Resources
and Life Sciences Education
Soil Science Society of America Journal
Journal of Plant Registrations Journal of
Environmental Quality
The Plant Genome