|
|
||||||||
a Facultad de Agronomia, Univ. de la Republica Oriental del Uruguay, Garzon 780, Montevideo, Uruguay
b Biometrics and Statistics Unit, CIMMYT, Apdo. Postal 6-641, 06600 Mexico DF, México
* Corresponding author (j.crossa{at}cgiar.org)
| ABSTRACT |
|---|
|
|
|---|
Abbreviations: Can-r, canonical correlation EM, expectation maximization GM, Gaussian mixture model LM, location model MLM, modified location model
| INTRODUCTION |
|---|
|
|
|---|
Hierarchical clustering methods (Kaufman and Rousseeuw, 1990) such as Ward's minimum variance (Ward, 1963) and the Unweighted Pair Grouping with Arithmetic Means (UPGMA) can be used with categorical and continuous variables. However, statistical models such as the Gaussian mixture model (GM; Day, 1969; Wolfe, 1970; McLachlan and Basford, 1988) can be used only with continuous attributes.
The statistical location model (LM) was proposed by Lawrence and Krzanowski (1996) for classifying individuals into g homogeneous groups using a mixture of categorical and continuous variables. The LM combines the levels of all the categorical variables into one unique multinomial variable, W, with m levels; for example, combining results of two binary variables (b1 and b2) produces a variable W with m = 4, that is, m(b1, b2) with m(0, 0) = 1, m(0, 1) = 2, m(1, 0) = 3, and m(1, 1) = 4. The LM is a conditional model because it estimates the means of the continuous variables conditioned on the values of variable W.
Franco et al. (1998) proposed the MLM in the context of a two-stage clustering strategy in which initial groups are first defined using a hierarchical clustering method (Ward). The MLM is then used with those groups (Ward-MLM). One assumption of the MLM is the independence between the m levels of W and the vector of the continuous variables. That is, as opposed to the LM, the MLM estimates the mean vector of the continuous variables for each subpopulation independently of the values of W.
The Ward-MLM clustering strategy was used by Franco et al. (1998)(1999), and by Taba et al. (1998)(1999) with the number of observations ranging from 100 to 1800, and variable W with levels ranging from 10 to 50. In all of these experiments, the same set of continuous and categorical variables were used, and the correlation between W and the vector of continuous variables, when measured across all groups, ranged from 0.290 to 0.865. When the correlation was measured as the mean of the correlations by group, the range of variation was 0.379 to 0.795. These results show that the level of association between the continuous variables and variable W varies; thus, the robustness of the model is important in defining the appropriate classification strategy.
Jorgensen and Hunt (1996) and Hunt and Jorgensen (1999) used a class of multivariate mixture models that allow the inclusion of discrete and continuous variables in the classification. These models assume complete or local independence among variables and included the LM as one of their options.
Another assumption of the MLM (and LM) is that the underlying subpopulations may have homogeneous or heterogeneous variancecovariance matrices, that is, the covariances (between attributes) and variances (of the attributes) for each subpopulation can be assumed to be equal (homogeneous) or to be different (heterogeneous) (see accompanying paper Franco et al., 2002). This assumption affects the number of parameters to be estimated by the GM, LM, and MLM, and it may be a limitation when a large number of attributes are measured. In this context, Jorgensen and Hunt (1996) and Hunt and Jorgensen (1999) pointed out that the multivariate normal mixture models are highly parameterized, which leads to several difficulties during the estimation process.
The effect of using the MLM assuming different degrees of association between variable W and the continuous variables and assuming homogeneous or heterogeneous models for classifying hypothetical data sets having subpopulations with homogeneous or heterogeneous variancecovariance matrices can only be examined and tested using simulated scenarios with a known structure. In real data sets, the structure of the underlying subpopulations is unknown, as are the variances of each attribute and the associations between attributes within subpopulations. In this case, the researcher can only compare the resulting clusters under the homogeneous and heterogeneous models and under the complete independence and conditional models.
The objective of this study was to compare the results of the two-stage sequential Ward-MLM strategy for several simulated data sets with known structures of the W variable and of the continuous variables, and their associations. The study examined the robustness of the Ward-MLM strategy for recovering underlying true simulated subpopulations when independence between the multinomial variable, W, and the continuous variables does not hold. The strategy was tested under different conditions of overlap across populations on the continuous and on the discrete variables.
| MATERIALS AND METHODS |
|---|
|
|
|---|
=
, where for j = 1, ..., n observations, ysjk is one of the p-continuous variables, and ws is the realization of the multinomial variable W, or the multinomial cell.
According to Franco et al. (1998), the likelihood of the data matrix Xnx(p+1) for the LM is
![]() | [1] |
![]() | [2] |
contains the parameters of the model;
i (i = 1, 2, ..., g) is the proportion of observations into each subpopulation (cluster) of the mixture; pis is the proportion of observations into the sth multinomial cell of the ith subpopulation;
is the common variancecovariance matrix within subpopulation; µis and µi are the vectors of means of the isth cell (LM) and the ith subpopulation (MLM), respectively. The X matrix is considered incomplete in the sense that the true original subpopulation for each observation is unknown. Therefore, a solution is obtained using the finite mixture of distributions (Everitt and Hand, 1981) theory, and the expectation maximization (EM) algorithm (Dempster et al., 1977). The matrix is completed with vectors assigning each observation to each subpopulation and the log likelihood for the complete data matrix is obtained. Then, the Ward groups are used as the starting point for the EM algorithm and the maximum likelihood estimates of the parameters are computed. This process allows (i) finding an optimal clustering that maximizes the log-likelihood of the observations and (ii) obtaining the probability of membership for each observation into each cluster.
There are some interesting statistical and practical considerations associated with the LM and the MLM models. The conditional LM model assumes that the multinomial variable W may be associated with the multinormal vector ysj due to hypothetical association between some discrete variables and the multinormal vector. This is expressed in Eq. [1] by the presence of (m x g) vectors of means, µis, for each multinomial cell and subpopulation. On the other hand, the MLM model assumes independence between W and the multinormal variabless, so it is more restrictive than the LM model. This is expressed in Eq. [2] for the presence of only g vectors of means, µi, for each subpopulation averaged across multinomial cells. Because of these distribution assumptions, the LM model produces a likelihood function, Eq.[1], in which each observation is compared with the multinomial cell mean, µis; therefore, the maximization process will search for homogeneous groups around the cell mean and not around the subpopulation mean. On the other hand, the MLM model compares each observation with the subpopulation mean, µi, Eq. [2], (but not with the cell means); thus, the maximization process will search for homogeneous groups around the subpopulation mean.
The LM model is more parameterized than the MLM because it requires the estimation of one vector of means in each of the (m x g) cells, thus the number of means to be estimated is (m x g x p); the MLM is less parameterized because it requires only the estimation of (g x p) mean parameters. Furthermore, the LM may require the estimation of means belonging to cells that do not have sampling data (empty cells, frequently found on real data sets); this does not occur in the MLM.
Estimation of the Optimal Number of Groups
The same two rules used by Franco et al. (1998)(1999) for estimating the optimal number of clusters were utilized: the upper tail approach, or Mojena's rule (Wishart, 1987), and the likelihood profile, associated with the likelihood ratio test (Mardia et al., 1979). The likelihood profile is used as a graphical display for observing the changes to the log-likelihood function in relation to the number of groups. The optimal number of clusters occurs when the log-likelihood function shows its highest increase.
General Description of the Simulated Scenarios
The objective here was to generate simulated data sets with medium and strong levels of association between W and the continuous variables. In addition, some overlapping between subpopulations with respect to W and the continuous variables were included in order to simulate more realistic situations.
Three basic simulated multivariate normal data sets with different options were used for generating the continuous variables and the W variable. Thus, a total of eight different scenarios forming a 2 by 2 by 2 factorial arrangement were created for testing the Ward-MLM strategy (Table 1) . The three factors were (i) overlapping of the values of the continuous variables between subpopulations, high (H) and low (L); (ii) overlapping of the values of the W variable between subpopulations, no (N) and yes (Y); and (iii) strength of the dependence between W and the multi-normal vector measured by the canonical correlation (Can-r), strong (S) and medium (M).
|
Simulation of the Continuous Variables
The first two types of basic simulated multinormal data sets were formed for five subpopulations and five continuous variables (named dimensions because they are considered random numbers, not biological variables). These data sets were generated following Milligan (1980) and Milligan et al. (1983). It is important to point out that these simulations are not for estimating parameters and their confidence intervals but rather for studying the behavior of the classification methods under different known scenarios of a factorial experimental design. For this reason, Milligan (1980) and Milligan et al. (1983) used three replicates. In this study, however, five replicates were used.
As proposed by Milligan (1980) and Milligan et al. (1983), two restrictions were considered during the process of simulation: (i) in order to satisfy the requirement of external isolation among groups, overlap on the first dimension between subpopulations was not allowed, and (ii) in order to allow the subpopulations having internal cohesion, the SAS MVN macro (SAS Institute, 2001), used as the multinormal random number generator, was truncated to accept only random normal standard numbers within the interval (-1.5, 1.5).
The first type of data set had 250 observations, five subpopulations, and 50 observations per subpopulation. This type does not allow overlap on the first dimension and low degree of overlap on the other dimensions. This type of data set was used in Scenarios 1, 2, 3, and 4. The second type of data set had 125 observations and formed five subpopulations and 25 observations per subpopulation. This type does not allow overlap on the first dimension, but allows a high degree of overlap on the remaining dimensions. This data was included in Scenarios 5, 6, and 7. The third basic data set was generated using the means and variancecovariance matrix of five continuous variables (days to anthesis, days to silk, plant height, ear height, and grain moisture) from a maize (Zea mays L.) evaluation trial from the USA (Taba et al., 1999). Each subpopulation had 150 individuals; the data set had 750 observations. This data set has well-differentiated subpopulations but shows overlapping on all the dimensions across subpopulations. This data was used in Scenario 8 with only one replicate.
An illustration of the high and low overlap that was randomly simulated for the continuous variables is shown in Fig. 1 ; high and low overlap of values for the dimensions D2 through D5 between subpopulations (S1S5) is observed. Dimension D1 had no overlap (data not shown).
|
Simulation of the Multinomial Variable W
The next step was to generate the multinomial random variable, W, with a different degree of association with the vector of the multivariate normal variables. Different degree of overlap on the variable W between subpopulations was allowed.
For each scenario, the average degree of association between the continuous variables and W, across subpopulations and replications, was measured by the Can-r coefficient, and it ranged from medium, Can-r = 0.49, to strong, Can-r = 0.89 (Table 1). Further details of the simulation process are given in the Appendix.
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
Scenarios 1 (LNS) and 2 (LNM)
These two scenarios comprise the data sets with low overlap on the continuous variables (L), nonoverlap on the discrete variables (N), and strong (S) and medium (M) association between the multinomial and the continuous variables. These scenarios included the most separated subpopulations, and, as expected, the true number of five subpopulations was correctly identified (Fig. 2a and 2b
show the log-likelihood profile used for the definition of the number of groups). For all 10 replicates, the Ward-MLM strategy completely recovered the true subpopulations even under the strong dependence (Can-r = 0.84). In summary, for Scenarios 1 and 2, all of the subpopulations in all replications were fully recovered, and the number of true subpopulations was correctly identified on a consistent basis.
|
In summary, for all replicates of Scenarios 3 and 4, all of the subpopulations were fully recovered with none of them mixing observations between subpopulations. In four replicates the correct number of subpopulations was identified; in six replicates one more group than the true number was estimated.
Scenarios 5 (HNS) and 6 (HNM)
These two scenarios had high overlapping on the continuous variables, no overlapping on the discrete variables, and a strong and medium association between W and the continuous variables. The appropriate number of groups was correctly identified in seven out of 10 replicates (Fig. 2e, 2f), and the true subpopulations were totally recovered. For Scenario 5 (strong dependence, Can-r = 0.86, Fig. 2e), three replicates correctly estimated the number of true subpopulations and their composition; the other two replicates predicted six groups. In both cases, one subpopulation was split into two groups without mixing observations among subpopulations. For Scenario 6 (medium dependence, Can-r = 0.65, Fig. 2f), four replicates correctly identified the true number of subpopulations and their composition. The other replicate estimated six groups, splitting one subpopulation into two groups but without mixing observations among subpopulations. In summary, for Scenarios 5 and 6, all of the subpopulations in all of the replications were totally recovered, seven times using the correct number of subpopulations and three times using one more group than the initially simulated number.
Scenarios 7 (HYS) and 8 (HYM)
These two scenarios had high overlapping on the continuous variables and on the discrete variables, and a strong and medium association between W and the continuous variables. The MLM after Ward analysis applied to the HYS scenario correctly identified the number of true subpopulations (Fig. 2g) in three out of five replicates. The correct compositions of the five subpopulations were also fully recovered in all replicates.
For Scenario 8 (HYM), the Ward-MLM strategy estimated four subpopulations (Fig. 2h). The analysis of the structure of the five true subpopulations and the four estimated groups obtained by the Ward-MLM are shown in Table 2 . The average Mahalanobis distance, D2 (Mahalanobis, 1930) among groups with respect to the continuous variables, D2cont, was higher among the five true subpopulations than among the four Ward-MLM groups (84.21 vs. 40.39). However, the average distance between groups with respect to the discrete variables, Ddisc, was higher among the four Ward-MLM groups than among the five true subpopulations (1.00 vs. 0.74); a similar result was obtained for the mixture of discrete and continuous variables measured by the Matusita (1956) distance, Dmixt (Krzanowski, 1983) (1.39 vs. 1.41).
|
approximation for the Wilk's test measuring the statistical differences among vectors of group means in the multivariate analysis of variance showed statistical differences (P < 0.01) among group means in both cases (Table 2). The influence of each continuous variable (V1, V3, and V5) on the differences between groups, measured as the correlation coefficient between each variable and the first canonical variable (Mardia et al 1979), showed a similar level of importance of the three continuous variables for the five true subpopulations (0.75, 0.76, and 0.73). However, when using the four groups obtained by the Ward-MLM strategy, V1 and V3 had more effect on the groups' differences (0.99 and 0.97) than V5 (0.65) (Table 2). These results indicate that for the HYM Scenario, the Ward-MLM strategy produces well-separated groups with respect to the discrete variables and the mixture of discrete and continuous variables, but less well-separated groups with respect to only the continuous variables. The data set structure of the five true subpopulations and the four groups (G4) generated by Ward-MLM strategy for Scenario 8 (HYM) are shown in Table 3 . The Ward-MLM strategy joined Cells 1 and 3 (from true subpopulations SP1 and SP2) to form group G4 = 1, with 187 = 148 + 39 observations. These two cells correspond to the combination (Q1, Q2) = (0, 1) having mean values for V1 and V3 (48.39, 122.31, and 73.79, 261.87, respectively) lower than the mean values for Cells 2 and 4 (81.22, 338.91, and 95.56, 383.66, respectively) also belonging to subpopulations SP1 and SP2. Note that V1 and V3 were the most important variables for differentiating the four groups (Table 2). Similarly, the Ward-MLM strategy merged Cells 2, 4, 5, and 6 (belonging to true subpopulations SP1, SP2, and SP3) with the combination (Q1, Q2) = (0, 2) and (Q1, Q2) = (0, 3) to form G4 = 2 with 187 = 2 + 11 + 68 + 6 observations. These cells had mean values for V1 and V3 [(81.22, 338.91), (95.56, 386.66), (116.35, 511.50) and (124.81, 584.55), respectively] higher than the above and lower than the remaining. This pattern is repeated in the formation of the other G4 groups.
|
| CONCLUSIONS |
|---|
|
|
|---|
The highly overlapping simulated data generated under Scenario 8 produced subpopulations with very indistinct boundaries. This should make the recovery of the true subpopulation difficult by any classification method. Nevertheless, the Ward-MLM strategy seemed to recover meaningful groups based on all available information.
Results of this study confirm the statement by Franco et al. (1998) that "the assumption of independence between the continuous variables and the multinomial variable (W) of the MLM model are not as restrictive as might appear at first glance." The MLM proved to be a robust model under medium and strong dependence between the variable W and the vector of continuous variables, and under various kinds of overlap on the continuous and discrete variables.
The process of simulating each scenario and replication was done in two steps: (i) selection of the vector of means, and variancecovariance matrices per subpopulation and generation of the random multinormal data sets using the SAS MVN macro (SAS Institute, 2001), and (ii) generation of the multinomial variable W and its association with the continuous variables.
Simulation of the Continuous Variables
The first step in the generation of the first two data sets used in Scenarios 1 and 2 was to select the expected mean vectors and the expected variancecovariance matrices per subpopulation. This was done in stage as follows: (i) generating random ranges from a Uniform(10, 40) distribution for each subpopulation, (ii) defining the standard deviation as 1/3 of each range, and (iii) computing the common standard deviation, S, as the average value of the standard deviation of individual subpopulations.
The nonoverlap subpopulation means corresponding to the first dimension were separated by a random distance contained in the interval [3.5S4.5S]. For the other dimensions and the high overlap condition we generated, for each subpopulation and dimension, random ranges from a Uniform(10, 40) distribution. We then defined the mean of each subpopulation in each dimension as the mean point of the range, and the standard deviation as 1/3 of each range. Finally, we calculated the common standard deviation as the mean of the previously defined standard deviations for each dimension.
For the second, third, fourth, and fifth dimensions, and the low overlap condition, we separated the subpopulation means in each dimension a distance of 3S, using the mean and the common standard deviation obtained for the first dimension, as and described above.
The expected covariances were calculated as: S = diag (V)-1 x R x diag(V), where diag(V) is a diagonal matrix containing the common standard deviation, S, for each dimension, and R is a correlation matrix obtained from a maize field evaluation of accessions from Uruguay obtained from the Latin American Maize Project (Taba et al., 1999). The variables from that evaluation were days to anthesis, days to silking, plant height, ear height, and grain moisture. The correlation matrix, R, was:
![]() | [A1] |
For the third basic data set (used for Scenario 8), the mean vectors and variancecovariance matrix were obtained using the overall observed means (m) and the variances (s2) from five variables (days to anthesis, days to silk, plant height, ear height, and grain moisture) from a maize evaluation trial from the USA (Taba et al., 1999). The values m - 2S, m + 2S, m + 6S, m + 10S, and m + 14S were assigned as the simulated subpopulation expected means, allowing a distance of 4Ss between means of neighbor subpopulations on each dimension.
The variancecovariance matrix, S, and the corresponding correlation matrix, R, obtained from the real data, were:
![]() | [A2] |
Once the expected vectors of means and the variancecovariance matrices were defined, we used the SAS MNV macro (SAS Institute, 2001) for generating the random multinormal values for each replication, within each subpopulation.
Simulation of the Multinomial Variable W
The process of assigning the discrete values of the W variable for forming the different dependence conditions (S = strong or M = medium) of Table 1, for Scenarios 1 to 7 (and their replicates), consisted in using linear combinations (modified by generating random uniform values, but not allowing linear dependencies of 1.0) of the continuous variables to create the ws values for each observation per subpopulation.
Scenario 8 (Table 1) used the third type of continuous data set. In this scenario we used only three of the five continuous variables (V1, V3, and V5) and generated two categorical variables from the remaining two continuous variables {V2 was highly correlated with V1 (0.99) and V4 was highly correlated with V3 (0.95); see R matrix of Eq. [A2]}. The first categorical variable is a binary (0, 1) variable that takes the value of 0 when V2 had a value less than or equal to its mean and the value of 1 otherwise. The second categorical variable is a multi-state variable that takes the values 1, 2, 3, and 4 when the fourth continuous variable (V4) had values within the intervals (P0, P.25), (P.25, P.50), (P.50, P.75), or (P.75, P1.0), respectively, where Pp is the pth observed percentile. Combining these two categorical variables, the W variable, with six levels, was generated. This procedure generated a medium association (Can-r = 0.63, Table 1) between the vector of continuous variables and the W variable. Scenario 8 has high overlap on all the continuous variables and also in the values of the multinomial variable.
For simulating other situations, two kinds of overlap on the multinomial variable were allowed (Table A1) . The first produced unclear boundaries among subpopulations because the same multinomial values were assigned to observations belonging to different subpopulations but having similar values for the continuous variables. This situation simulated real data sets in which unclear boundaries among subpopulations are encountered. The second type of overlap included extraneous values for W within subpopulations. For example, in Table A1, subpopulation SP1 takes values of W = 1,2,7, and 8 with frequencies 20, 20, 5, and 5, so the two last values can be considered extraneous values for this subpopulation (they are frequently values in SP4).
|
| ACKNOWLEDGMENTS |
|---|
Received for publication September 5, 2001.
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
J. Franco, J. Crossa, M. L. Warburton, and S. Taba Sampling Strategies for Conserving Maize Diversity When Forming Core Subsets Using Genetic Markers Crop Sci., February 24, 2006; 46(2): 854 - 864. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Franco, J. Crossa, S. Taba, and H. Shands A Sampling Strategy for Conserving Genetic Diversity when Forming Core Subsets Crop Sci., May 6, 2005; 45(3): 1035 - 1044. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Franco, J. Crossa, S. Taba, and S. A. Eberhart The Modified Location Model for Classifying Genetic Resources: II. Unrestricted Variance-Covariance Matrices Crop Sci., September 1, 2002; 42(5): 1727 - 1736. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| The SCI Journals | Agronomy Journal | Vadose Zone Journal | |||
| Journal of Natural Resources and Life Sciences Education |
Soil Science Society of America Journal | ||||
| Journal of Plant Registrations | Journal of Environmental Quality |
The Plant Genome | |||