Crop Science Illumina
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Web of Science (7)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Franco, J.
Right arrow Articles by Eberhart, S. A.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Franco, J.
Right arrow Articles by Eberhart, S. A.
Agricola
Right arrow Articles by Franco, J.
Right arrow Articles by Eberhart, S. A.
Related Collections
Right arrow Biometrics
Right arrow Plant Genetic Resources
Right arrow Statistics
Crop Science 42:1727-1736 (2002)
© 2002 Crop Science Society of America

PLANT GENETIC RESOURCES

The Modified Location Model for Classifying Genetic Resources

II. Unrestricted Variance–Covariance Matrices

Jorge Francoa, José Crossa*,b, Suketoshi Tabab and Steve A. Eberhartc

a Facultad de Agronomía, Univ. de la República Oriental del Uruguay, Garzón 780, Montevideo, Uruguay
b Maize Genetic Resources Unit, CIMMYT, Apdo. Postal 6-641, 06600 Mexico DF, México
c National Seed Storage Laboratory, USDA-ARS, Fort Collins, CO 80523

* Corresponding author (j.crossa{at}cgiar.org)


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
When evaluating genetic resources and forming core subsets, gene bank accessions are classified into homogeneous and well-separated groups. The modified location model (MLM) is used in the context of a two-stage clustering strategy in which initial groups are first defined using a hierarchical clustering method (such as Ward) and then the MLM is applied to the groups that are formed (Ward-MLM). The MLM allows assuming correlations (between attributes) and variances (of the attributes) among subpopulations (SPs) to be equal (homogeneous, HOM) or different (heterogeneous, HET). The objectives of this study were (i) to compare the effect of assuming homogeneity with the effect of assuming heterogeneity of variance–covariance matrices on the classification of two simulated data sets using the Ward-MLM strategy; and (ii) to make the same type of comparison using data from maize (Zea mays L.) accessions from nine countries. When simulated HOM data were analyzed with the HOM model and the simulated HET data were analyzed with the HET model, some of the original SPs were represented in the resulting clusters but others changed and formed more separated groups. The HET model always formed the most compact and separated clusters, even for HOM data. Classification of 10 real data sets showed that the HET model produced more compact and well-separated groups than the HOM model. However, only the HOM model identified and grouped a small number of observations with very peculiar attributes. Although the HET model may suffice in most situations, the recommended strategy when classifying genetic resources would be to use both models.

Abbreviations: EM, expectation maximization • GM, Gaussian model • HET, heterogeneous • HOM, homogeneous • LM, location model • MLM, modified location model • SP, subpopulation


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
WHEN EVALUATING AND STUDYING genetic resources and their diversity and when forming core subsets, multivariate data on continuous and categorical attributes of gene bank accessions are collected (Brown, 1989; Crossa et al., 1995). Individual accessions can be conceptualized as being located in a multidimensional space in which there is one dimension for each variable. The shape and structure of the groups of accessions in this multidimensional space in unknown, but the association between attributes influence the shape of the groups, and the structure is affected by the true composition of the groups. Hierarchical, nonhierarchical, and statistical classification methods attempt to recover, as much as possible, the true shape and structure of the underlying groups.

In genetic resources conservation and the formation of core subsets, the main objective is to select accessions that best represent the entire collection with the minimum loss of genetic diversity. Therefore, the best numerical classification strategy is the one that produces the most compact and well separated groups (i.e., minimum variability within each group and maximum variability among groups).

The location model (LM) proposed by Lawrence and Krzanowski (1996) classifies individuals into g HOM groups using categorical and continuous variables. The model combines the levels of the discrete variables in one unique multinomial variable, W, with m values. Franco et al. (1998) proposed the MLM in the context of a two-stage clustering strategy in which initial groups are first defined using a hierarchical clustering method [such as Ward or Unweighted Pair Grouping with Arithmetic Means (UPGMA)] and then the MLM is applied with the purpose of improving those groups.

The MLM model makes two important assumptions. First, the multinomial variable, W, is independent from the vector of continuous variables. Second, the underlying SPs can have restricted HOM or unrestricted HET variance–covariance matrices. In other words, the correlations (between attributes) and variances (of the attributes) can be assumed to be equal (HOM) or to be different (HET) across SPs. The accompanying article of Franco and Crossa (2002) demonstrated, using different simulated scenarios, that the MLM is very robust when the independence between the variable W and the vector of continuous variables does not hold even when high overlapping occurs between SPs on the values of the W variable and on the continuous variables. Franco and Crossa (2002) showed that the true number of SPs is accurately estimated by the Ward-MLM strategy and that the SPs are fully recovered in most cases, even when strong dependence holds.

The assumption of heterogeneity or homogeneity variance–covariance matrices across SPs affects the number of parameters to be estimated, and it may be a limitation when a large number of attributes are measured. For p-continuous attributes measured on each individual, the number of parameters to be estimated under the homogeneity assumption is g(1 + m + p) + p(p + 1)/2, and g[1 + m + p + p(p + 1)/2] under the heterogeneity assumption (for g SP proportions, gm multinomial cell proportions, gp SP means, and p(p + 1)/2 and gp(p + 1)/2 variances and covariances, respectively). In other words, the MLM method under heterogeneity of variance–covariance matrices estimates (g - 1)[p(p + 1)/2] more parameters than under the homogeneity assumption; when this number is large the heterogeneity model cannot be used.

In practical situations, however, it seems realistic to assume that covariances between some pairs of attributes and variances of attributes change, depending on the subset of accessions (i.e., SPs) under consideration. For example, subsets of accessions with differential tolerance/susceptibility to a specific disease might show varying associations between that disease and grain yield. Jorgensen and Hunt (1996) and McLachlan and Basford (1988) pointed out some difficulties for highly parameterized models such as those related to the fact that the likelihood function of a mixture model can present several singularities, and that many local maximums can be found.

Similar to using the MLM for testing the effect of the dependence between the W variable and the vector of continuous variables (Franco and Crossa, 2002), the effect of using the MLM assuming homogeneity or heterogeneity for classifying a hypothetical data set having SPs with HOM or HET variance–covariance matrices can only be tested using simulated experimental scenarios with a known structure. In real data sets, the structure of the underlying SPs is unknown, as are the variances of each attribute and the associations between attributes across SPs. In this case (and only if the relation between the number of observations and the number of estimated parameters allows to use the HET model) the researcher can compare the resulting clusters under the homogeneity and heterogeneity models and then select the best classification based on appropriate numerical and statistical criteria such as the maximum distance between groups, the minimum variance within groups, and the probability of membership of each observation into each group.

The objectives of this study were (i) to compare the effect of using the Ward-MLM strategy assuming among-group homogeneity with the effect of assuming among-group heterogeneity of variance–covariance matrices, on the classification of two simulated data sets with known structures and either HOM or HET variance–covariance matrices; and (ii) to compare the classifications obtained assuming homogeneity with those obtained assuming heterogeneity of variance–covariance matrices across SPs using data from maize accessions from nine countries (Taba et al., 1999).


    MATERIALS AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Details of the Gaussian model (GM) (Day, 1969; Wolfe, 1970; McLachlan and Basford, 1988), the LM (Lawrence and Krzanowski, 1996), and the MLM (Franco et al., 1998) and their maximum likelihood estimates obtained using the expectation maximization (EM) algorithm (Dempster et al., 1977) are given in Franco et al. (1998).

The Modified Location Model
Assume a random sample of n individuals from a mixture of a certain number of unknown SPs. Let the n vectors, each of size p + q, be the observations of p-continuous and q-categorical variables on each of the n individuals. The MLM combines the levels of each of the q (k = 1,..., q) categorical variables into one multinomial variable, W, with values ws = 1, 2,... m (m being the maximum number of possible levels or multinomial cells). Each observation can be written as an 1 x (p + 1) vector x'sj = = , where for j = 1,..., n observations, ysjk is one of the p-continuous variables, and ws the multinomial value or multinomial cell. Note that the extra index s is used for pointing out that the LM is conditioned on the multinomial cell s.

The likelihood function corresponding to the matrix of the entire sample data Xnx(p+1) is

where ysj is the vector of the continuous variables of the jth observation on the sth multinomial cell; {Theta} contains the parameters of the model; {alpha}i (i = 1,..., g) is the proportion of observations into each SP (cluster) of the mixture; pis is the proportion of observations into the sth multinomial cell of the ith SP; {sum} is the common variance–covariance matrix across SPs, and µi is the vector of means of the ith SP. The X matrix is considered incomplete in the sense that the true original SP for which each observation belongs is unknown. Therefore, a solution is obtained using the EM algorithm (Dempster et al., 1977). The matrix is completed when each observation is assigned to each SP. Then, the log likelihood for the complete data matrix is obtained using the initial Ward groups as the starting point for the EM algorithm. The maximum likelihood estimates of the parameters are found. The formed groups maximize the log-likelihood of the observations and the probability of membership of each observation into each group is obtained.

Assuming heterogeneity of variance–covariance matrices across groups, the maximum likelihood estimate of the variance–covariance is

where isj is the probability of membership of each observation in the ith SP. Assuming homogeneity of variance–covariance matrices across groups, its maximum likelihood estimate is given by

The probability of membership for each observation belonging to the ith SP assuming heterogeneity of variance–covariance is estimated as

When homogeneity of variances-covariances is assumed, i should be replaced by .

The heterogeneity assumption implies an increase in the number of parameters to be estimated. This is a limitation of the model. Also, when the size of the cluster ni = {alpha}in is lower than p + 1 (p being the number of continuous variables), the within-cluster variance–covariance matrix is singular and thus the maximum likelihood estimators of the probability of membership cannot be computed. This imposes a lower bound on the cluster size of ni >= (p + 1). On the other hand, when the variance–covariance matrices are assumed to be HOM, a unique pooled variance–covariance matrix is used, and therefore the required lower bound for having a nonsingular variance–covariance matrix is n >= (p + 1) which, in general, does not impose any problem for estimating the parameters.

Two-stage Ward-Modified Location Model Method
The two-stage Ward-MLM strategy was proposed by Franco et al. (1998). The underlying idea is that the initial groups formed by the Ward method, used as the starting point by the MLM model, will allow a better approximation to some maximum (global or local) because this starting point has the property implied in the Ward method; that is, to minimize the within-group sum of squares.

As pointed out by Franco et al. (1998), when the total number of cells m x g is very large, the Ward grouping will not be improved by the MLM because the observations are spread out widely across the cells, and the final classification will be the same as the Ward classification.

Estimation of the Optimal Number of Subpopulations
The optimal number of SPs is determined using the upper tail approach (Wishart, 1987) combined with the likelihood ratio test (Mardia et al., 1979), and the log-likelihood profile (Franco et al., 1998). For a specific simulated scenario, in order to make results comparable, the same number of initial groups was used for the Ward-MLM strategy, assuming homogeneity and heterogeneity of variance–covariance matrices.

Measuring Distances Between Groups for the Continuous Variables
For comparing the results under homogeneity and heterogeneity of variance–covariance matrices, the average, 2, of the Mahalanobis (1930) distances, D2ij, between each pair of groups (i, j) was used. When the HOM variance–covariance matrix is assumed, the Mahalanobis distance is D2ij = D2ji = ' -1 . When HET variance–covariances matrices is assumed, D2ij = ' -1i , D2ji = ' -1j , and D2ij != D2ji. However, the average of the average distances between groups based on the D2ij is equal to the average of the average distance between groups based on D2ji , and both are equal to 2. Thus 2 estimated assuming heterogeneity can be compared directly with 2 estimated assuming homogeneity.

Measuring Distances Between Groups for the Multinomial Variable
Krzanowski (1983) proposed measurements of affinity and distance between groups for the LM model. For the categorical variables, the author uses {sum}ms=1 (pis pjs)1/2 as a measure of affinity between any pair of groups (i, j), where pis and pjs denote the proportion of cases with the sth value (s = 1,..., m, the multinomial cell) in the i and j groups, respectively. We used the average, , of the distances dij = 1 - {sum}ms=1 (pis pjs)1/2 between every pair of groups as a measure of distance corresponding to the discrete part of the model.

Measuring the Quality of a Classification
For the GM, McLachlan and Basford (1988) proposed, as a measure of the quality of a classification, the average of the maximum of the probabilities of membership,

Because an observation is assigned to a group with maximum probability of membership, it is expected that well-classified individuals will have a high probability of membership.

Measuring the Recovery of the True Clustering Structure
When the true population structure is known, as in the case of simulation of data, there are some useful indices for measuring the recovery of the initial population structure obtained through a clustering strategy. Milligan et al. (1983) recommended using the Corrected Rand and the Jaccard indices. Rand measurement is based on the following ratio: the number of pairs of observations that remain in the same group before the classification (in the simulated SPs) and after classification (in the formed groups), plus those that remain in different groups before and after the classification, over the total number of pairs of observations. Jaccard does not include the number of pairs that remains in different groups before and after.

Optimal Classification
Within the framework of searching for an optimal method for forming core subsets, the best numerical classification is the one that forms compact groups (minimum variance), well differentiated (maximum distances), and with observations that have the highest probability of membership.

Software
The CLUSTAN (Wishart, 1987) software was used for the Ward method with the Gower (1971) distance. The maximum likelihood MLM method using the EM algorithm was implemented in IML SAS (1990) (Franco et al., 1998) considering two cases: HOM and HET variance–covariance matrices across groups.

Data
Simulated Scenarios
The MVN macro described in the Technical Supplement of SAS (SAS Institute, 2001) was used to generate the eight multivariate normal SPs (four with HOM and four with HET variance–covariance matrices). The SP means, variances, and covariances used as input in the MVN macro were obtained using, as a baseline, the means and variance–covariance matrices of two continuous variables from Taba et al. (1998), days to anthesis (V1) and plant height (V2). Values corresponding to one categorical multi-state variable were generated (W).

The means, variances, and correlation coefficients of the two continuous variables (V1 and V2) for each of the eight simulated SPs (four SPs with HET and four SPs with HOM) are shown in Table 1 . Classification using the Ward-MLM strategy was applied to both simulated data sets (HOM and HET) with the MLM assuming HOM and HET among group variance–covariance matrices. Therefore, four cases were studied: HOM-HOM, HOM-HET, HET-HOM, and HET-HET, in which the first abbreviation indicates the type of simulated data considered and the second denotes the assumption made when using the MLM. Further details about the simulated W and continuous variables and the SAS codes for generating these values are given in the Appendix.


View this table:
[in this window]
[in a new window]
 
Table 1. Means, variances, and correlation coefficients (r) of two continuous variables, days to silk (V1), and plant height (V2) of simulated data. Number of observations for each level of the multinomial variable W for four subpopulations (SP) of size (ni) with heterogeneity (HET) and homogeneity (HOM) of variance–covariance.

 
Experimental Data
Ten experimental data sets (from nine different countries) were obtained from the Latin American Maize Project (Taba et al., 1999). The experiments included a wide range of maize accessions. The data sets are named based on their country of origin: BOLIVIA, BRAZIL, CHILE, COLOMBIA, GUATEMALA1, GUATEMALA2, MEXICO, URUGUAY, USA, and VENEZUELA. In each data set, nine attributes (five continuous and four categorical) were included for classification: days to anthesis and silking, plant and ear height, percentage of grain moisture at harvest, kernel color and type, number of ears per plant (<=1 and >1), and ear quality (a visual 0-9 scale transformed to a binary variable: <=4.5 and >4.5).


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Simulated Data
Characteristics of the Simulated SPs With HOM and HET Variance–Covariance Matrices
The means of V1 and V2 and structures of W associated with SP1 through SP4 do not change between the HET and HOM cases. As expected, for the HOM case, the variances of V1 and V2 are the same across SP1 through SP4 (Table 1). For the HOM and HET cases, SPs SP1 and SP4 have the same structure with respect to the categorical variable W, but have a wide difference with respect to the means of the two continuous variables (Table 1). However, due to differences in variances between SPs for the two cases, SP1 and SP4 have pair-wise Mahalanobis distances of D214 = 138.9 for the HOM data set and a mean of = 1819.3 for the HET data set. Subpopulations SP2 and SP3 have the same structure for the categorical variable, being more similar than SP1 and SP4 with respect to the means of the continuous variables: They had pair-wise Mahalanobis distances of D223 = 16.4 for the HOM data set and a mean of = 12.25 for the HET data set.

Scatter plots for V1 and V2 of the four simulated SPs with HOM and HET data are shown in Fig. 1a and 1b , respectively. For the HOM data, the shape of the SPs is similar (Fig. 1a). However, for the HET data, the strong association between V1 and V2 in SP4 (r = 0.97, Table 1) determines its more elongated shape (Fig. 1b) as compared with the shape of the observations belonging to SP3, in which the association between V1 and V2 is weaker (r = 0.35, Table 1). The influence of the categorical variable on the underlying SPs when plotted against V1 and V2 is shown for the HET data in the three-dimensional plot of Fig. 1c. Individuals from SP4 have an elongated shape due to the high association between V1 and V2; with respect to categorical variable, 20 individuals have values of W = 1 and five individuals show values of W = 2.



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 1. Plot of the simulated observations of four subpopulations with (a) homogeneity and (b) heterogeneity of variance–covariance for days to silk (V1) and plant height (V2); the symbols square, asterisk, diamond, and ball correspond to subpopulations SP1, SP2, SP3, and SP4, respectively. Three-dimensional representation (c) of the simulated observations of four subpopulations with heterogeneity of variance–covariance for V1, V2, and the categorical variable (W); the symbols cube, cylinder, pyramid, and circle correspond to subpopulations SP1, SP2, SP3, and SP4, respectively.

 
Classification of the Simulated Data using Ward-MLM Strategy with MLM assuming HOM and HET of Variance–Covariance Matrices
The log-likelihood profiles, used along with the upper tail approach for determining the optimal number of groups obtained by the Ward-MLM strategy, showed four groups (the largest increases in the log-likelihood function occurred from three to four groups) for the HOM data when analyzed with the HOM and HET models (i.e., HOM-HOM and HOM-HET) (Fig. 2a) . Values of the log-likelihood are always larger for the HET model than for the HOM model. Similar results were obtained for the HET data when analyzed with the HOM and HET models (Fig. 2b). These results show the success of the method in accurately estimating the number of true SPs.



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 2. Plot of the log-likelihood profile of the simulated data with homogeneity of variance–covariance matrices for the number of groups formed under the modified location model assuming (a) homogeneity (HOM-HOM) and heterogeneity (HOM-HET) and (b) homogeneity (HET-HOM) and heterogeneity (HET-HET) of variance–covariance matrices. Squares represent homogeneity, circles represent heterogeneity.

 
Table 2 presents the characteristics of the groups formed under the four cases. For the HOM-HOM and HET-HET cases, and in terms of the continuous and categorical variables, the structure and the shape (correlation coefficient) of the original SP1 and SP4 were well recovered in Groups G1 and G4 (Tables 1 and 2). On the other hand, the structure and the shape of SP2 and SP3 were not well recovered in G2 and G3, respectively. The variances and correlations of V1 and V2 in G2 and G3 are greater than in SP2 and SP3, respectively, for HET-HET and HOM-HOM. Therefore, there is a decrease in homogeneity with respect to the continuous variables, and an increase in homogeneity with respect to the categorical W variable (Tables 1 and 2).


View this table:
[in this window]
[in a new window]
 
Table 2. Means, variances, and within group correlation coefficient (r) of two continuous variables, days to silk (V1), and plant height (V2); and number of observations for each level of the categorical variable (W) for four groups (G) of size (ni) obtained applying the Ward-MLM strategy assuming heterogeneity (HET) and homogeneity (HOM) of variance-covariance on the simulated data with heterogeneity (HET) and homogeneity (HOM) of variance-covariance structure.

 
The impact of the categorical variable on the clusters formed under HET-HET is depicted in Fig. 1c and 3 . Groups G1 and G4 completely recovered SP1 and SP4. Group G2, with 10 individuals, comprises five individuals from SP2 with W = 2 and five individuals from SP3 with W = 2. The 20 individuals from SP2 (with W = 3) merged with 20 individuals from SP3 (with W = 3) and formed Group G3.



View larger version (32K):
[in this window]
[in a new window]
 
Fig. 3. Three-dimensional representation of the observations of four groups obtained by the modified location model assuming heterogeneity of variance–covariance matrices in simulated data with HET variance–covariance matrices for days to silk (V1), plant height (V2), and the multinomial variable (W). The symbols cube, cylinder, pyramid, and circle correspond to Groups G1, G2, G3, and G4, respectively.

 
As expected, the HOM model applied to HET data (HET-HOM) tends to homogenize the variances and covariance of V1 and V2 across groups, whereas the HET model, used to classify HOM data (HOM-HET), tends to differentiate the variances and covariance. However, due to the changes in the groups caused by the effect of the categorical variable W, the shapes (correlation between attributes) of the resulting groups do not show a clear pattern compared with the shapes of the SPs. For HET-HOM, two groups had smaller correlations and two had greater correlations between the attributes than those obtained in the original SPs (Tables 1 and 2). For the case HOM-HET, correlations for the four groups increase compared with correlations for the original SPs. Most of the attribute variances of the groups increase compared with the attribute variances in the original SPs, and variances of groups obtained with the HET model tend to be greater than those obtained with the HOM model.

Results from the average distances between groups (Mahalanobis distance, 2) for the continuous variables and from the average distance between groups for the categorical variables () show that classification using the HET model gives a better numerical separation between groups than the HOM model, regardless of the underlying variance–covariance structure of the data (Table 3) . However, the best recovery of the true clustering structure, as measured by the Corrected-Rand and Jaccard indexes, was obtained by the HOM model for the HOM data (HOM-HOM) and by the HET model for the HET data (HET-HET) (Table 3). In summary, the Ward-MLM strategy did not completely recover the original SPs, but rearranged the observations in such a way that the across-group homogeneity is decreased with respect to the continuous variables but it increased with respect to the categorical variable. In both cases, the HET model produced well-differentiated groups.


View this table:
[in this window]
[in a new window]
 
Table 3. Average Mahalanobis distances between groups for the continuous variable (2) and average distances between groups for the categorical variable () under the homogeneity (HOM) and the heterogeneity (HET) models for simulated data with HET and HOM of variance–covariance. External measures of recovery of the true cluster structure: Corrected-Rand (C-Rand) and Jaccard indexes.

 
In general, results of the simulation indicate that HET-HET and HOM-HOM recovered the composition of some original SPs and that the HET model formed more differentiated groups with both HOM and HET data.

Experimental Data
Table 4 shows that the Ward-MLM strategy with the HET assumption gave rise to better clusters than the Ward-MLM with the HOM assumption in (i) eight data sets with a larger average Mahalanobis distances (2); (ii) six data sets with a larger average distance for the categorical variables (); (iii) five data sets with a larger average probability of membership ; and (iv) seven data sets with a smaller percentage of observations assigned to a group with P <= 0.75.


View this table:
[in this window]
[in a new window]
 
Table 4. Data set identified by country of origin, average Mahalanobis distances among groups for the continuous variables , and average distance among groups for the categorical variables ), average of the maximum probability of membership , and percentage of observations assigned with probability of membership P <= 0.75 under the homogeneous (HOM) and heterogeneous (HET) models.

 
Further examination of the clusters formed by the HET and HOM models applied to the 10 data sets shows that the HET model produces higher chi-square values for rejecting the null hypothesis that the group variance–covariances are HOM (Mardia et al., 1979) and higher minimum group size than the HOM model (Table 5) . All chi-square values for HET models are higher than those for the HOM model, except for GUATEMALA2. For COLOMBIA, the increase in the chi-square value of the HET model as compared with the HOM model is small in relation to the rest of the data sets (except GUATEMALA2).


View this table:
[in this window]
[in a new window]
 
Table 5. Number of observations (n) and groups (g), minimum group size (min), number of groups with singular variance–covariance matrix (sing), and chi-square statistic for the homogeneity of variance–covariance matrices ({chi}2) for data sets identified by country of origin and analyzed under the homogeneous (HOM) and heterogeneous (HET) models.

 
The HET model does not produce any singular variance–covariance matrix within groups because all groups formed under the HET model have sizes >=6 [lower bound should be ni >= (p + 1) with p = 5 continuous variables] (Table 5). Under the HOM model, this lower bound is not a problem for the maximum likelihood estimate of the probability of membership, because there is one unique pooled variance–covariance matrix and the lower bound should be n > (p + 1). Note that the singularities obtained in COLOMBIA, GUATEMALA2, MEXICO, URUGUAY, and USA under the HOM model (Table 5) rise when some groups have a size of <6.

It is worthwhile to examine the two cases in which the HOM model produced more separate groups (COLOMBIA and GUATEMALA2) than the HET model (2 for HET is smaller than 2 for HOM, Table 4). Also, these two data sets showed the lowest values for the chi-square test statistic for the homogeneity of variance–covariance test among groups (Table 5). Table 6 shows the characteristics of the seven clusters obtained under the HOM and HET model on the data set from COLOMBIA. The most important difference can be observed in Group G5. In this group, the HOM model isolated three accessions characterized by their lowest values for all variables, whereas the HET model merged these three accessions with another seven to form a group of 10 accessions with more moderate average values. As previously mentioned, the minimum group size of six imposed by the HET model does not allow the isolation of these three extreme values as does the HOM model. The behavior of the GUATEMALA2 data set under the HOM and HET models is similar (data not shown).


View this table:
[in this window]
[in a new window]
 
Table 6. Group (G), days to anthesis (ANTH), days to silk (SILK), plant height (PLHT), ear height (EAHT), moisture (MOIS), and size of the groups (ni) description for the data set from COLOMBIA under the homogeneous (HOM) and heterogeneous (HET) models.

 

    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Results from the simulated HOM and HET data using the Ward-MLM strategy with the MLM assuming HOM and HET variance–covariance matrices indicate that in HOM-HOM and HET-HET most of the original SPs are retained in the resulting clusters. However, with regard to the continuous and categorical variables, the HET models always formed the most compact and most separated clusters, even for HOM data. In real data sets, the structure of the variance–covariance matrices is unknown because the composition of the underlying groups is unknown. It is reasonable to assume, however, that in most practical situations, trait variances and trait correlations change from SP to SP. Under these circumstances, results show that the recommended strategy would be to classify the data under both models, HET (if the number of parameters allows its use) and HOM, and, based on numerical criteria such as 2, , maximum probability of classification, and number of observations classified with P < 0.75, to decide which clusters should be considered as final clusters.

Classification of 10 real data sets shows that the HET model tends to produce more compact and well-separated groups than the HOM model. However, only the HOM model was able to identify a small number of observations with very peculiar characteristics with regard to the attributes. For example, the HOM model isolated a group with three accessions with very low values for most of the variables. These three accessions belong to three maize races that are characterized as being very early with short plant type. The HET model, because of its restriction on cluster size (minimum cluster size should be equal to or larger than the number of continuous variables included in the analyses + 1) does not allow these observations to remain alone, and in this instance, merged them with seven other accessions to form a group of 10.

As expected, the HET model formed clusters with more HET variance–covariances across groups than the HOM model, and chi-square statistics for homogeneity of variance–covariance matrices is larger for clusters formed under the HET model than those formed under the HOM model. This result shows that the HET model preserved the original shape of the underlying SPs more faithfully than the HOM model (although the shape is perhaps more elongated due to the differing trait associations), if, in fact, the data structure has heterogeneity of variance–covariance matrices.

The results of this study show that a recommended strategy for classifying genetic resources would be to apply the Ward-MLM approach with MLM assuming both HOM and HET variance–covariance matrices. Although the HET model seems to form more compact and better separated groups than the HOM model, only the HOM has the advantage of being able to isolate small clusters with accessions with extreme values for some attributes. If few extreme values exist across most traits, the HET model will tend to merge them with other observations. The HET model, however, may suffice in most situations.

APPENDIX

For the continuous variables the vectors of means and the variance–covariance matrices for four groups obtained by Taba et al. (1998) for two variables, days to silk (V1) and plant height (V2), were included in the SAS MVN macro for generating four SPs with heterogeneity of variance–covariance. For the case of homogeneity of variance–covariance, the same four vectors of means were used. One variance–covariance matrix was used in which the variance of V1 was the average variance across the four SPs, the variance of V2 is the average variance across the four SPs plus an arbitrary value of 50 and the covariance (V1, V2) was the average of the four SPs. The W values were assigned within each SP, allowing the distribution shown in Table 1.

The SAS codes are as follows:

Heterogeneity of variance–covariance matrices:

%include ‘mvn.sas’;

*Store the variance–covariance matrix in four data sets;

data varcov1; input m1-m2; cards;

;

data varcov2; input m1-m2; cards;

;

data varcov3; input m1-m2; cards;

;

data varcov4; input m1-m2; cards;

;

*Store the mean vectors in four data sets;

DATA MEANS1; input m1 @@; cards;

;

DATA MEANS2; input m2 @@; cards;

;

DATA MEANS3; input m3 @@; cards;

;

DATA MEANS4; input m4 @@; cards;

;




run; quit;

DATA JF.HET;SET SAMPLE1 SAMPLE2 SAMPLE3 SAMPLE4; RUN;

Homogeneity of variance–covariance matrices:

OPTIONS LS = 132 PS = 5000;

%include ‘mvn.sas’;

*Store the pooled variance–covariance matrix in one data set;

data varcova; input m1-m2; cards;

;

*Store the mean vectors in five data sets;

DATA MEANS1; input m1 @@; cards;

;

DATA MEANS2; input m2 @@; cards;

;

DATA MEANS3; input m3 @@; cards;

;

DATA MEANS4; input m4 @@; cards;

;




run; quit;

DATA JF.HOM;SET SAMPLE1 SAMPLE2 SAMPLE3 SAMPLE4; RUN;

Received for publication September 5, 2001.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 




This article has been cited by other articles:


Home page
Crop Sci.Home page
J. Franco, J. Crossa, and S. Desphande
Hierarchical Multiple-Factor Analysis for Classifying Genotypes Based on Phenotypic and Genetic Data
Crop Sci., December 30, 2009; 50(1): 105 - 117.
[Abstract] [Full Text] [PDF]


Home page
Crop Sci.Home page
J. Franco, J. Crossa, M. L. Warburton, and S. Taba
Sampling Strategies for Conserving Maize Diversity When Forming Core Subsets Using Genetic Markers
Crop Sci., February 24, 2006; 46(2): 854 - 864.
[Abstract] [Full Text] [PDF]


Home page
Crop Sci.Home page
J. Franco, J. Crossa, S. Taba, and H. Shands
A Sampling Strategy for Conserving Genetic Diversity when Forming Core Subsets
Crop Sci., May 6, 2005; 45(3): 1035 - 1044.
[Abstract] [Full Text] [PDF]


Home page
Crop Sci.Home page
J. Franco and J. Crossa
The Modified Location Model for Classifying Genetic Resources: I. Association between Categorical and Continuous Variables
Crop Sci., September 1, 2002; 42(5): 1719 - 1726.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Web of Science (7)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Franco, J.
Right arrow Articles by Eberhart, S. A.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Franco, J.
Right arrow Articles by Eberhart, S. A.
Agricola
Right arrow Articles by Franco, J.
Right arrow Articles by Eberhart, S. A.
Related Collections
Right arrow Biometrics
Right arrow Plant Genetic Resources
Right arrow Statistics


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
The SCI Journals Agronomy Journal Vadose Zone Journal
Journal of Natural Resources
and Life Sciences Education
Soil Science Society of America Journal
Journal of Plant Registrations Journal of
Environmental Quality
The Plant Genome