|
|
||||||||
a Facultad de Agronomía, Univ. de la República Oriental del Uruguay, Garzón 780, Montevideo, Uruguay
b Maize Genetic Resources Unit, CIMMYT, Apdo. Postal 6-641, 06600 Mexico DF, México
c National Seed Storage Laboratory, USDA-ARS, Fort Collins, CO 80523
* Corresponding author (j.crossa{at}cgiar.org)
| ABSTRACT |
|---|
|
|
|---|
Abbreviations: EM, expectation maximization GM, Gaussian model HET, heterogeneous HOM, homogeneous LM, location model MLM, modified location model SP, subpopulation
| INTRODUCTION |
|---|
|
|
|---|
In genetic resources conservation and the formation of core subsets, the main objective is to select accessions that best represent the entire collection with the minimum loss of genetic diversity. Therefore, the best numerical classification strategy is the one that produces the most compact and well separated groups (i.e., minimum variability within each group and maximum variability among groups).
The location model (LM) proposed by Lawrence and Krzanowski (1996) classifies individuals into g HOM groups using categorical and continuous variables. The model combines the levels of the discrete variables in one unique multinomial variable, W, with m values. Franco et al. (1998) proposed the MLM in the context of a two-stage clustering strategy in which initial groups are first defined using a hierarchical clustering method [such as Ward or Unweighted Pair Grouping with Arithmetic Means (UPGMA)] and then the MLM is applied with the purpose of improving those groups.
The MLM model makes two important assumptions. First, the multinomial variable, W, is independent from the vector of continuous variables. Second, the underlying SPs can have restricted HOM or unrestricted HET variancecovariance matrices. In other words, the correlations (between attributes) and variances (of the attributes) can be assumed to be equal (HOM) or to be different (HET) across SPs. The accompanying article of Franco and Crossa (2002) demonstrated, using different simulated scenarios, that the MLM is very robust when the independence between the variable W and the vector of continuous variables does not hold even when high overlapping occurs between SPs on the values of the W variable and on the continuous variables. Franco and Crossa (2002) showed that the true number of SPs is accurately estimated by the Ward-MLM strategy and that the SPs are fully recovered in most cases, even when strong dependence holds.
The assumption of heterogeneity or homogeneity variancecovariance matrices across SPs affects the number of parameters to be estimated, and it may be a limitation when a large number of attributes are measured. For p-continuous attributes measured on each individual, the number of parameters to be estimated under the homogeneity assumption is g(1 + m + p) + p(p + 1)/2, and g[1 + m + p + p(p + 1)/2] under the heterogeneity assumption (for g SP proportions, gm multinomial cell proportions, gp SP means, and p(p + 1)/2 and gp(p + 1)/2 variances and covariances, respectively). In other words, the MLM method under heterogeneity of variancecovariance matrices estimates (g - 1)[p(p + 1)/2] more parameters than under the homogeneity assumption; when this number is large the heterogeneity model cannot be used.
In practical situations, however, it seems realistic to assume that covariances between some pairs of attributes and variances of attributes change, depending on the subset of accessions (i.e., SPs) under consideration. For example, subsets of accessions with differential tolerance/susceptibility to a specific disease might show varying associations between that disease and grain yield. Jorgensen and Hunt (1996) and McLachlan and Basford (1988) pointed out some difficulties for highly parameterized models such as those related to the fact that the likelihood function of a mixture model can present several singularities, and that many local maximums can be found.
Similar to using the MLM for testing the effect of the dependence between the W variable and the vector of continuous variables (Franco and Crossa, 2002), the effect of using the MLM assuming homogeneity or heterogeneity for classifying a hypothetical data set having SPs with HOM or HET variancecovariance matrices can only be tested using simulated experimental scenarios with a known structure. In real data sets, the structure of the underlying SPs is unknown, as are the variances of each attribute and the associations between attributes across SPs. In this case (and only if the relation between the number of observations and the number of estimated parameters allows to use the HET model) the researcher can compare the resulting clusters under the homogeneity and heterogeneity models and then select the best classification based on appropriate numerical and statistical criteria such as the maximum distance between groups, the minimum variance within groups, and the probability of membership of each observation into each group.
The objectives of this study were (i) to compare the effect of using the Ward-MLM strategy assuming among-group homogeneity with the effect of assuming among-group heterogeneity of variancecovariance matrices, on the classification of two simulated data sets with known structures and either HOM or HET variancecovariance matrices; and (ii) to compare the classifications obtained assuming homogeneity with those obtained assuming heterogeneity of variancecovariance matrices across SPs using data from maize accessions from nine countries (Taba et al., 1999).
| MATERIALS AND METHODS |
|---|
|
|
|---|
The Modified Location Model
Assume a random sample of n individuals from a mixture of a certain number of unknown SPs. Let the n vectors, each of size p + q, be the observations of p-continuous and q-categorical variables on each of the n individuals. The MLM combines the levels of each of the q (k = 1,..., q) categorical variables into one multinomial variable, W, with values ws = 1, 2,... m (m being the maximum number of possible levels or multinomial cells). Each observation can be written as an 1 x (p + 1) vector x'sj =
=
, where for j = 1,..., n observations, ysjk is one of the p-continuous variables, and ws the multinomial value or multinomial cell. Note that the extra index s is used for pointing out that the LM is conditioned on the multinomial cell s.
The likelihood function corresponding to the matrix of the entire sample data Xnx(p+1) is
![]() |
contains the parameters of the model;
i (i = 1,..., g) is the proportion of observations into each SP (cluster) of the mixture; pis is the proportion of observations into the sth multinomial cell of the ith SP;
is the common variancecovariance matrix across SPs, and µi is the vector of means of the ith SP. The X matrix is considered incomplete in the sense that the true original SP for which each observation belongs is unknown. Therefore, a solution is obtained using the EM algorithm (Dempster et al., 1977). The matrix is completed when each observation is assigned to each SP. Then, the log likelihood for the complete data matrix is obtained using the initial Ward groups as the starting point for the EM algorithm. The maximum likelihood estimates of the parameters are found. The formed groups maximize the log-likelihood of the observations and the probability of membership of each observation into each group is obtained.
Assuming heterogeneity of variancecovariance matrices across groups, the maximum likelihood estimate of the variancecovariance is
![]() |
isj is the probability of membership of each observation in the ith SP. Assuming homogeneity of variancecovariance matrices across groups, its maximum likelihood estimate is given by
![]() |
The probability of membership for each observation belonging to the ith SP assuming heterogeneity of variancecovariance is estimated as
![]() |
When homogeneity of variances-covariances is assumed,
i should be replaced by
.
The heterogeneity assumption implies an increase in the number of parameters to be estimated. This is a limitation of the model. Also, when the size of the cluster ni =
in is lower than p + 1 (p being the number of continuous variables), the within-cluster variancecovariance matrix is singular and thus the maximum likelihood estimators of the probability of membership cannot be computed. This imposes a lower bound on the cluster size of ni
(p + 1). On the other hand, when the variancecovariance matrices are assumed to be HOM, a unique pooled variancecovariance matrix is used, and therefore the required lower bound for having a nonsingular variancecovariance matrix is n
(p + 1) which, in general, does not impose any problem for estimating the parameters.
Two-stage Ward-Modified Location Model Method
The two-stage Ward-MLM strategy was proposed by Franco et al. (1998). The underlying idea is that the initial groups formed by the Ward method, used as the starting point by the MLM model, will allow a better approximation to some maximum (global or local) because this starting point has the property implied in the Ward method; that is, to minimize the within-group sum of squares.
As pointed out by Franco et al. (1998), when the total number of cells m x g is very large, the Ward grouping will not be improved by the MLM because the observations are spread out widely across the cells, and the final classification will be the same as the Ward classification.
Estimation of the Optimal Number of Subpopulations
The optimal number of SPs is determined using the upper tail approach (Wishart, 1987) combined with the likelihood ratio test (Mardia et al., 1979), and the log-likelihood profile (Franco et al., 1998). For a specific simulated scenario, in order to make results comparable, the same number of initial groups was used for the Ward-MLM strategy, assuming homogeneity and heterogeneity of variancecovariance matrices.
Measuring Distances Between Groups for the Continuous Variables
For comparing the results under homogeneity and heterogeneity of variancecovariance matrices, the average,
2, of the Mahalanobis (1930) distances, D2ij, between each pair of groups (i, j) was used. When the HOM variancecovariance matrix is assumed, the Mahalanobis distance is D2ij = D2ji =
'
-1
. When HET variancecovariances matrices is assumed, D2ij =
'
-1i
, D2ji =
'
-1j
, and D2ij
D2ji. However, the average of the average distances between groups based on the D2ij
is equal to the average of the average distance between groups based on D2ji
, and both are equal to
2. Thus
2 estimated assuming heterogeneity can be compared directly with
2 estimated assuming homogeneity.
Measuring Distances Between Groups for the Multinomial Variable
Krzanowski (1983) proposed measurements of affinity and distance between groups for the LM model. For the categorical variables, the author uses
ms=1 (pis pjs)1/2 as a measure of affinity between any pair of groups (i, j), where pis and pjs denote the proportion of cases with the sth value (s = 1,..., m, the multinomial cell) in the i and j groups, respectively. We used the average,
, of the distances dij = 1 -
ms=1 (pis pjs)1/2 between every pair of groups as a measure of distance corresponding to the discrete part of the model.
Measuring the Quality of a Classification
For the GM, McLachlan and Basford (1988) proposed, as a measure of the quality of a classification, the average of the maximum of the probabilities of membership,
![]() |
Because an observation is assigned to a group with maximum probability of membership, it is expected that well-classified individuals will have a high probability of membership.
Measuring the Recovery of the True Clustering Structure
When the true population structure is known, as in the case of simulation of data, there are some useful indices for measuring the recovery of the initial population structure obtained through a clustering strategy. Milligan et al. (1983) recommended using the Corrected Rand and the Jaccard indices. Rand measurement is based on the following ratio: the number of pairs of observations that remain in the same group before the classification (in the simulated SPs) and after classification (in the formed groups), plus those that remain in different groups before and after the classification, over the total number of pairs of observations. Jaccard does not include the number of pairs that remains in different groups before and after.
Optimal Classification
Within the framework of searching for an optimal method for forming core subsets, the best numerical classification is the one that forms compact groups (minimum variance), well differentiated (maximum distances), and with observations that have the highest probability of membership.
Software
The CLUSTAN (Wishart, 1987) software was used for the Ward method with the Gower (1971) distance. The maximum likelihood MLM method using the EM algorithm was implemented in IML SAS (1990) (Franco et al., 1998) considering two cases: HOM and HET variancecovariance matrices across groups.
Data
Simulated Scenarios
The MVN macro described in the Technical Supplement of SAS (SAS Institute, 2001) was used to generate the eight multivariate normal SPs (four with HOM and four with HET variancecovariance matrices). The SP means, variances, and covariances used as input in the MVN macro were obtained using, as a baseline, the means and variancecovariance matrices of two continuous variables from Taba et al. (1998), days to anthesis (V1) and plant height (V2). Values corresponding to one categorical multi-state variable were generated (W).
The means, variances, and correlation coefficients of the two continuous variables (V1 and V2) for each of the eight simulated SPs (four SPs with HET and four SPs with HOM) are shown in Table 1 . Classification using the Ward-MLM strategy was applied to both simulated data sets (HOM and HET) with the MLM assuming HOM and HET among group variancecovariance matrices. Therefore, four cases were studied: HOM-HOM, HOM-HET, HET-HOM, and HET-HET, in which the first abbreviation indicates the type of simulated data considered and the second denotes the assumption made when using the MLM. Further details about the simulated W and continuous variables and the SAS codes for generating these values are given in the Appendix.
|
1 and >1), and ear quality (a visual 0-9 scale transformed to a binary variable:
4.5 and >4.5). | RESULTS |
|---|
|
|
|---|
= 1819.3
for the HET data set. Subpopulations SP2 and SP3 have the same structure for the categorical variable, being more similar than SP1 and SP4 with respect to the means of the continuous variables: They had pair-wise Mahalanobis distances of D223 = 16.4 for the HOM data set and a mean of
= 12.25
for the HET data set. Scatter plots for V1 and V2 of the four simulated SPs with HOM and HET data are shown in Fig. 1a and 1b , respectively. For the HOM data, the shape of the SPs is similar (Fig. 1a). However, for the HET data, the strong association between V1 and V2 in SP4 (r = 0.97, Table 1) determines its more elongated shape (Fig. 1b) as compared with the shape of the observations belonging to SP3, in which the association between V1 and V2 is weaker (r = 0.35, Table 1). The influence of the categorical variable on the underlying SPs when plotted against V1 and V2 is shown for the HET data in the three-dimensional plot of Fig. 1c. Individuals from SP4 have an elongated shape due to the high association between V1 and V2; with respect to categorical variable, 20 individuals have values of W = 1 and five individuals show values of W = 2.
|
|
|
|
Results from the average distances between groups (Mahalanobis distance,
2) for the continuous variables and from the average distance between groups for the categorical variables (
) show that classification using the HET model gives a better numerical separation between groups than the HOM model, regardless of the underlying variancecovariance structure of the data (Table 3)
. However, the best recovery of the true clustering structure, as measured by the Corrected-Rand and Jaccard indexes, was obtained by the HOM model for the HOM data (HOM-HOM) and by the HET model for the HET data (HET-HET) (Table 3). In summary, the Ward-MLM strategy did not completely recover the original SPs, but rearranged the observations in such a way that the across-group homogeneity is decreased with respect to the continuous variables but it increased with respect to the categorical variable. In both cases, the HET model produced well-differentiated groups.
|
Experimental Data
Table 4
shows that the Ward-MLM strategy with the HET assumption gave rise to better clusters than the Ward-MLM with the HOM assumption in (i) eight data sets with a larger average Mahalanobis distances (
2); (ii) six data sets with a larger average distance for the categorical variables (
); (iii) five data sets with a larger average probability of membership
; and (iv) seven data sets with a smaller percentage of observations assigned to a group with P
0.75.
|
|
6 [lower bound should be ni
(p + 1) with p = 5 continuous variables] (Table 5). Under the HOM model, this lower bound is not a problem for the maximum likelihood estimate of the probability of membership, because there is one unique pooled variancecovariance matrix and the lower bound should be n > (p + 1). Note that the singularities obtained in COLOMBIA, GUATEMALA2, MEXICO, URUGUAY, and USA under the HOM model (Table 5) rise when some groups have a size of <6.
It is worthwhile to examine the two cases in which the HOM model produced more separate groups (COLOMBIA and GUATEMALA2) than the HET model (
2 for HET is smaller than
2 for HOM, Table 4). Also, these two data sets showed the lowest values for the chi-square test statistic for the homogeneity of variancecovariance test among groups (Table 5). Table 6
shows the characteristics of the seven clusters obtained under the HOM and HET model on the data set from COLOMBIA. The most important difference can be observed in Group G5. In this group, the HOM model isolated three accessions characterized by their lowest values for all variables, whereas the HET model merged these three accessions with another seven to form a group of 10 accessions with more moderate average values. As previously mentioned, the minimum group size of six imposed by the HET model does not allow the isolation of these three extreme values as does the HOM model. The behavior of the GUATEMALA2 data set under the HOM and HET models is similar (data not shown).
|
| DISCUSSION |
|---|
|
|
|---|
2,
, maximum probability of classification, and number of observations classified with P < 0.75, to decide which clusters should be considered as final clusters. Classification of 10 real data sets shows that the HET model tends to produce more compact and well-separated groups than the HOM model. However, only the HOM model was able to identify a small number of observations with very peculiar characteristics with regard to the attributes. For example, the HOM model isolated a group with three accessions with very low values for most of the variables. These three accessions belong to three maize races that are characterized as being very early with short plant type. The HET model, because of its restriction on cluster size (minimum cluster size should be equal to or larger than the number of continuous variables included in the analyses + 1) does not allow these observations to remain alone, and in this instance, merged them with seven other accessions to form a group of 10.
As expected, the HET model formed clusters with more HET variancecovariances across groups than the HOM model, and chi-square statistics for homogeneity of variancecovariance matrices is larger for clusters formed under the HET model than those formed under the HOM model. This result shows that the HET model preserved the original shape of the underlying SPs more faithfully than the HOM model (although the shape is perhaps more elongated due to the differing trait associations), if, in fact, the data structure has heterogeneity of variancecovariance matrices.
The results of this study show that a recommended strategy for classifying genetic resources would be to apply the Ward-MLM approach with MLM assuming both HOM and HET variancecovariance matrices. Although the HET model seems to form more compact and better separated groups than the HOM model, only the HOM has the advantage of being able to isolate small clusters with accessions with extreme values for some attributes. If few extreme values exist across most traits, the HET model will tend to merge them with other observations. The HET model, however, may suffice in most situations.
For the continuous variables the vectors of means and the variancecovariance matrices for four groups obtained by Taba et al. (1998) for two variables, days to silk (V1) and plant height (V2), were included in the SAS MVN macro for generating four SPs with heterogeneity of variancecovariance. For the case of homogeneity of variancecovariance, the same four vectors of means were used. One variancecovariance matrix was used in which the variance of V1 was the average variance across the four SPs, the variance of V2 is the average variance across the four SPs plus an arbitrary value of 50 and the covariance (V1, V2) was the average of the four SPs. The W values were assigned within each SP, allowing the distribution shown in Table 1.
The SAS codes are as follows:
Heterogeneity of variancecovariance matrices:
%include mvn.sas;
*Store the variancecovariance matrix in four data sets;
data varcov1; input m1-m2; cards;
![]() |
data varcov2; input m1-m2; cards;
![]() |
data varcov3; input m1-m2; cards;
![]() |
data varcov4; input m1-m2; cards;
![]() |
*Store the mean vectors in four data sets;
DATA MEANS1; input m1 @@; cards;
![]() |
DATA MEANS2; input m2 @@; cards;
![]() |
DATA MEANS3; input m3 @@; cards;
![]() |
DATA MEANS4; input m4 @@; cards;
![]() |
![]() |
![]() |
![]() |
![]() |
DATA JF.HET;SET SAMPLE1 SAMPLE2 SAMPLE3 SAMPLE4; RUN;
Homogeneity of variancecovariance matrices:
OPTIONS LS = 132 PS = 5000;
%include mvn.sas;
*Store the pooled variancecovariance matrix in one data set;
data varcova; input m1-m2; cards;
![]() |
*Store the mean vectors in five data sets;
DATA MEANS1; input m1 @@; cards;
![]() |
DATA MEANS2; input m2 @@; cards;
![]() |
DATA MEANS3; input m3 @@; cards;
![]() |
DATA MEANS4; input m4 @@; cards;
![]() |
![]() |
![]() |
![]() |
![]() |
DATA JF.HOM;SET SAMPLE1 SAMPLE2 SAMPLE3 SAMPLE4; RUN;
Received for publication September 5, 2001.
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
J. Franco, J. Crossa, and S. Desphande Hierarchical Multiple-Factor Analysis for Classifying Genotypes Based on Phenotypic and Genetic Data Crop Sci., December 30, 2009; 50(1): 105 - 117. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Franco, J. Crossa, M. L. Warburton, and S. Taba Sampling Strategies for Conserving Maize Diversity When Forming Core Subsets Using Genetic Markers Crop Sci., February 24, 2006; 46(2): 854 - 864. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Franco, J. Crossa, S. Taba, and H. Shands A Sampling Strategy for Conserving Genetic Diversity when Forming Core Subsets Crop Sci., May 6, 2005; 45(3): 1035 - 1044. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Franco and J. Crossa The Modified Location Model for Classifying Genetic Resources: I. Association between Categorical and Continuous Variables Crop Sci., September 1, 2002; 42(5): 1719 - 1726. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| The SCI Journals | Agronomy Journal | Vadose Zone Journal | |||
| Journal of Natural Resources and Life Sciences Education |
Soil Science Society of America Journal | ||||
| Journal of Plant Registrations | Journal of Environmental Quality |
The Plant Genome | |||