Crop Science Journal of Natural Resources and Life Sciences Education
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via ISI Web of Science (5)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Franco, J.
Right arrow Articles by Crossa, J.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Franco, J.
Right arrow Articles by Crossa, J.
Agricola
Right arrow Articles by Franco, J.
Right arrow Articles by Crossa, J.
Related Collections
Right arrow Biometrics
Right arrow Plant Genetic Resources
Right arrow Statistics
Crop Science 42:1719-1726 (2002)
© 2002 Crop Science Society of America

PLANT GENETIC RESOURCES

The Modified Location Model for Classifying Genetic Resources

I. Association between Categorical and Continuous Variables

Jorge Francoa and José Crossa*,b

a Facultad de Agronomia, Univ. de la Republica Oriental del Uruguay, Garzon 780, Montevideo, Uruguay
b Biometrics and Statistics Unit, CIMMYT, Apdo. Postal 6-641, 06600 Mexico DF, México

* Corresponding author (j.crossa{at}cgiar.org)


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
When evaluating genetic resources, data on continuous and categorical attributes of accessions is collected. The Ward method and the modified location model (MLM) (Ward-MLM strategy) classify genetic resources using a mixture of categorical and continuous variables and combine all the categorical variables into one multinomial variable (W) that is assumed to be statistically independent from the continuous variables. The main objective of this study was to examine the robustness of the MLM for recovering underlying true subpopulations when independence between variable W and the continuous variables does not hold. In addition, several scenarios, based on different degrees of overlap of the continuous and discrete variables in simulated data sets, were generated. Results showed that when the subpopulations were well-differentiated, the Ward-MLM strategy effectively predicted the true number of subpopulations and fully recovered their structure, regardless of the level of dependence between the W variable and the vector of continuous variables. When the subpopulations showed unclear boundaries and a high degree of overlap in the W variable and in the continuous variables, the Ward-MLM strategy predicted a different number of subpopulations but fully recovered the composition of the subpopulations. In this case, new groups are formed that show a balanced and consistent structure in their composition as compared with the subpopulations. The MLM proved to be a robust model under medium and strong dependence between the variable W and the vector of continuous variables and under various kinds of overlapping between subpopulations with respect to the continuous and discrete variables.

Abbreviations: Can-r, canonical correlation • EM, expectation maximization • GM, Gaussian mixture model • LM, location model • MLM, modified location model


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
THE STUDY OF PHENOTYPIC AND GENETIC diversity and the formation of core subsets are important activities in the conservation, evaluation, and utilization of genetic resources (Brown, 1989; Crossa et al., 1995). When evaluating and classifying genetic resources, researchers collect data on multiple traits to attempt to recover, as much as possible, the structure (mean of attributes) and original shape (association between attributes) of the underlying subpopulations and predict the true number of subpopulations.

Hierarchical clustering methods (Kaufman and Rousseeuw, 1990) such as Ward's minimum variance (Ward, 1963) and the Unweighted Pair Grouping with Arithmetic Means (UPGMA) can be used with categorical and continuous variables. However, statistical models such as the Gaussian mixture model (GM; Day, 1969; Wolfe, 1970; McLachlan and Basford, 1988) can be used only with continuous attributes.

The statistical location model (LM) was proposed by Lawrence and Krzanowski (1996) for classifying individuals into g homogeneous groups using a mixture of categorical and continuous variables. The LM combines the levels of all the categorical variables into one unique multinomial variable, W, with m levels; for example, combining results of two binary variables (b1 and b2) produces a variable W with m = 4, that is, m(b1, b2) with m(0, 0) = 1, m(0, 1) = 2, m(1, 0) = 3, and m(1, 1) = 4. The LM is a conditional model because it estimates the means of the continuous variables conditioned on the values of variable W.

Franco et al. (1998) proposed the MLM in the context of a two-stage clustering strategy in which initial groups are first defined using a hierarchical clustering method (Ward). The MLM is then used with those groups (Ward-MLM). One assumption of the MLM is the independence between the m levels of W and the vector of the continuous variables. That is, as opposed to the LM, the MLM estimates the mean vector of the continuous variables for each subpopulation independently of the values of W.

The Ward-MLM clustering strategy was used by Franco et al. (1998)(1999), and by Taba et al. (1998)(1999) with the number of observations ranging from 100 to 1800, and variable W with levels ranging from 10 to 50. In all of these experiments, the same set of continuous and categorical variables were used, and the correlation between W and the vector of continuous variables, when measured across all groups, ranged from 0.290 to 0.865. When the correlation was measured as the mean of the correlations by group, the range of variation was 0.379 to 0.795. These results show that the level of association between the continuous variables and variable W varies; thus, the robustness of the model is important in defining the appropriate classification strategy.

Jorgensen and Hunt (1996) and Hunt and Jorgensen (1999) used a class of multivariate mixture models that allow the inclusion of discrete and continuous variables in the classification. These models assume complete or local independence among variables and included the LM as one of their options.

Another assumption of the MLM (and LM) is that the underlying subpopulations may have homogeneous or heterogeneous variance–covariance matrices, that is, the covariances (between attributes) and variances (of the attributes) for each subpopulation can be assumed to be equal (homogeneous) or to be different (heterogeneous) (see accompanying paper Franco et al., 2002). This assumption affects the number of parameters to be estimated by the GM, LM, and MLM, and it may be a limitation when a large number of attributes are measured. In this context, Jorgensen and Hunt (1996) and Hunt and Jorgensen (1999) pointed out that the multivariate normal mixture models are highly parameterized, which leads to several difficulties during the estimation process.

The effect of using the MLM assuming different degrees of association between variable W and the continuous variables and assuming homogeneous or heterogeneous models for classifying hypothetical data sets having subpopulations with homogeneous or heterogeneous variance–covariance matrices can only be examined and tested using simulated scenarios with a known structure. In real data sets, the structure of the underlying subpopulations is unknown, as are the variances of each attribute and the associations between attributes within subpopulations. In this case, the researcher can only compare the resulting clusters under the homogeneous and heterogeneous models and under the complete independence and conditional models.

The objective of this study was to compare the results of the two-stage sequential Ward-MLM strategy for several simulated data sets with known structures of the W variable and of the continuous variables, and their associations. The study examined the robustness of the Ward-MLM strategy for recovering underlying true simulated subpopulations when independence between the multinomial variable, W, and the continuous variables does not hold. The strategy was tested under different conditions of overlap across populations on the continuous and on the discrete variables.


    MATERIALS AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
Theory
The Location Model and the Modified Location Model
The problem of analyzing information composed of p-continuous and q-discrete variables simultaneously has been approached using the LM (Lawrence and Krzanowski, 1996). The idea is to reduce the set of q discrete variables to a single multinomial variable W taking values ws = 1, ..., m. Each value of ws corresponds to one combination of levels of the discrete variables; thus m is the total number of factorial combinations among levels of the discrete variables or the number of multinomial cells. The index s is used because the LM model is conditioned on the multinomial cell, s. Using this notation, each observation can be written as a 1 x (p + 1) vector x'sj = = , where for j = 1, ..., n observations, ysjk is one of the p-continuous variables, and ws is the realization of the multinomial variable W, or the multinomial cell.

According to Franco et al. (1998), the likelihood of the data matrix Xnx(p+1) for the LM is

[1]
and the likelihood for the MLM is

[2]
where the vector {Theta} contains the parameters of the model; {alpha}i (i = 1, 2, ..., g) is the proportion of observations into each subpopulation (cluster) of the mixture; pis is the proportion of observations into the sth multinomial cell of the ith subpopulation; {sum} is the common variance–covariance matrix within subpopulation; µis and µi are the vectors of means of the isth cell (LM) and the ith subpopulation (MLM), respectively.

The X matrix is considered incomplete in the sense that the true original subpopulation for each observation is unknown. Therefore, a solution is obtained using the finite mixture of distributions (Everitt and Hand, 1981) theory, and the expectation maximization (EM) algorithm (Dempster et al., 1977). The matrix is completed with vectors assigning each observation to each subpopulation and the log likelihood for the complete data matrix is obtained. Then, the Ward groups are used as the starting point for the EM algorithm and the maximum likelihood estimates of the parameters are computed. This process allows (i) finding an optimal clustering that maximizes the log-likelihood of the observations and (ii) obtaining the probability of membership for each observation into each cluster.

There are some interesting statistical and practical considerations associated with the LM and the MLM models. The conditional LM model assumes that the multinomial variable W may be associated with the multinormal vector ysj due to hypothetical association between some discrete variables and the multinormal vector. This is expressed in Eq. [1] by the presence of (m x g) vectors of means, µis, for each multinomial cell and subpopulation. On the other hand, the MLM model assumes independence between W and the multinormal variabless, so it is more restrictive than the LM model. This is expressed in Eq. [2] for the presence of only g vectors of means, µi, for each subpopulation averaged across multinomial cells. Because of these distribution assumptions, the LM model produces a likelihood function, Eq.[1], in which each observation is compared with the multinomial cell mean, µis; therefore, the maximization process will search for homogeneous groups around the cell mean and not around the subpopulation mean. On the other hand, the MLM model compares each observation with the subpopulation mean, µi, Eq. [2], (but not with the cell means); thus, the maximization process will search for homogeneous groups around the subpopulation mean.

The LM model is more parameterized than the MLM because it requires the estimation of one vector of means in each of the (m x g) cells, thus the number of means to be estimated is (m x g x p); the MLM is less parameterized because it requires only the estimation of (g x p) mean parameters. Furthermore, the LM may require the estimation of means belonging to cells that do not have sampling data (empty cells, frequently found on real data sets); this does not occur in the MLM.

Estimation of the Optimal Number of Groups
The same two rules used by Franco et al. (1998)(1999) for estimating the optimal number of clusters were utilized: the upper tail approach, or Mojena's rule (Wishart, 1987), and the likelihood profile, associated with the likelihood ratio test (Mardia et al., 1979). The likelihood profile is used as a graphical display for observing the changes to the log-likelihood function in relation to the number of groups. The optimal number of clusters occurs when the log-likelihood function shows its highest increase.

General Description of the Simulated Scenarios
The objective here was to generate simulated data sets with medium and strong levels of association between W and the continuous variables. In addition, some overlapping between subpopulations with respect to W and the continuous variables were included in order to simulate more realistic situations.

Three basic simulated multivariate normal data sets with different options were used for generating the continuous variables and the W variable. Thus, a total of eight different scenarios forming a 2 by 2 by 2 factorial arrangement were created for testing the Ward-MLM strategy (Table 1) . The three factors were (i) overlapping of the values of the continuous variables between subpopulations, high (H) and low (L); (ii) overlapping of the values of the W variable between subpopulations, no (N) and yes (Y); and (iii) strength of the dependence between W and the multi-normal vector measured by the canonical correlation (Can-r), strong (S) and medium (M).


View this table:
[in this window]
[in a new window]
 
Table 1. Eight different simulated Scenarios (1–8) obtained by varying the degree of overlap on the continuous variables (L = low, H = high), the presence or absence of overlap on the W variable (N = no, Y = yes), and the degree of association (S = Strong, M = Medium) measured by the canonical correlation (Can-r) between the vector of continuous variables and W.

 
For each scenario, five random and independent replications were simulated and five continuous variables and five subpopulations were generated (Scenario 8 has only one replicate). The number of observations per subpopulation ranged from 25 to 150; all data sets were generated under the assumption of homogeneity of variance–covariance matrices across subpopulations.

Simulation of the Continuous Variables
The first two types of basic simulated multinormal data sets were formed for five subpopulations and five continuous variables (named dimensions because they are considered random numbers, not biological variables). These data sets were generated following Milligan (1980) and Milligan et al. (1983). It is important to point out that these simulations are not for estimating parameters and their confidence intervals but rather for studying the behavior of the classification methods under different known scenarios of a factorial experimental design. For this reason, Milligan (1980) and Milligan et al. (1983) used three replicates. In this study, however, five replicates were used.

As proposed by Milligan (1980) and Milligan et al. (1983), two restrictions were considered during the process of simulation: (i) in order to satisfy the requirement of external isolation among groups, overlap on the first dimension between subpopulations was not allowed, and (ii) in order to allow the subpopulations having internal cohesion, the SAS MVN macro (SAS Institute, 2001), used as the multinormal random number generator, was truncated to accept only random normal standard numbers within the interval (-1.5, 1.5).

The first type of data set had 250 observations, five subpopulations, and 50 observations per subpopulation. This type does not allow overlap on the first dimension and low degree of overlap on the other dimensions. This type of data set was used in Scenarios 1, 2, 3, and 4. The second type of data set had 125 observations and formed five subpopulations and 25 observations per subpopulation. This type does not allow overlap on the first dimension, but allows a high degree of overlap on the remaining dimensions. This data was included in Scenarios 5, 6, and 7. The third basic data set was generated using the means and variance–covariance matrix of five continuous variables (days to anthesis, days to silk, plant height, ear height, and grain moisture) from a maize (Zea mays L.) evaluation trial from the USA (Taba et al., 1999). Each subpopulation had 150 individuals; the data set had 750 observations. This data set has well-differentiated subpopulations but shows overlapping on all the dimensions across subpopulations. This data was used in Scenario 8 with only one replicate.

An illustration of the high and low overlap that was randomly simulated for the continuous variables is shown in Fig. 1 ; high and low overlap of values for the dimensions D2 through D5 between subpopulations (S1–S5) is observed. Dimension D1 had no overlap (data not shown).



View larger version (23K):
[in this window]
[in a new window]
 
Fig. 1. Plot of the minimum (MIN), mean, and maximum (MAX) values of four simulated dimensions (D2–D5) and five subpopulations (S1–S5) of a random replicate showing high (a) and low (b) overlap. Dimension D1 (not shown) does not overlap.

 
Discriminant Analysis (SAS, 1989) using the subpopulations as classes and the five continuous variables was applied to all replications of the three data sets previously described. Results showed that none of them had any observation reclassified in the a posteriori analysis of the data. This result indicates that the subpopulations of the three data sets are well differentiated. This point is important because in order for any classification method to recover the true subpopulations, they must, in fact, exist in the data sets (i.e., subpopulations that are not well-differentiated form a structure that numerical classification methods can not separate). Further details of the data simulation process are given in the Appendix.

Simulation of the Multinomial Variable W
The next step was to generate the multinomial random variable, W, with a different degree of association with the vector of the multivariate normal variables. Different degree of overlap on the variable W between subpopulations was allowed.

For each scenario, the average degree of association between the continuous variables and W, across subpopulations and replications, was measured by the Can-r coefficient, and it ranged from medium, Can-r = 0.49, to strong, Can-r = 0.89 (Table 1). Further details of the simulation process are given in the Appendix.


    RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
The results of applying the Ward-MLM strategy to the various scenarios will be described giving consideration to two criteria. First, how well the number of true subpopulations was estimated, and second, how well the structure of the subpopulations was recovered. For example, the true number of subpopulations (five in this study) may be identified by the Ward-MLM strategy, but the observations of the new clusters may be a mixture of observations from different subpopulations; that is, the structure of the true subpopulations is not well recovered. On the other hand, perhaps the estimated number of subpopulations is six (with one of the true subpopulations split into two groups), but with complete recovery of the subpopulations, in the sense that no mixture of observations from different subpopulations occurred.

Scenarios 1 (LNS) and 2 (LNM)
These two scenarios comprise the data sets with low overlap on the continuous variables (L), nonoverlap on the discrete variables (N), and strong (S) and medium (M) association between the multinomial and the continuous variables. These scenarios included the most separated subpopulations, and, as expected, the true number of five subpopulations was correctly identified (Fig. 2a and 2b show the log-likelihood profile used for the definition of the number of groups). For all 10 replicates, the Ward-MLM strategy completely recovered the true subpopulations even under the strong dependence (Can-r = 0.84). In summary, for Scenarios 1 and 2, all of the subpopulations in all replications were fully recovered, and the number of true subpopulations was correctly identified on a consistent basis.



View larger version (31K):
[in this window]
[in a new window]
 
Fig. 2. Plot of the value of the log-likelihood function vs. the number of groups for each replication in Scenarios 1 through 8. The vertical line marks the true number of subpopulations. Code description of the scenarios describes the degree of overlap on the continuous variables (L = low, H = high), the presence or absence of overlap on the W variable (N = no, Y = yes), and the degree of association between the vector of continuous variables and W (S = Strong, M = Medium).

 
Scenarios 3 (LYS) and 4 (LYM)
These two scenarios include the data sets with low overlap on the continuous variables (L), overlap on the discrete variables (Y), and strong (S) and medium (M) association between the multinomial and the continuous variables. The number of five true subpopulations was correctly estimated in four out of 10 replicates (Fig. 2c, 2d). For Scenario 3(strong dependence, Can-r = 0.71 and unclear boundaries among subpopulations, Fig. 2c), all replicates predicted six groups, but without mixing observations from different subpopulations; in all the replicates, one of the true subpopulations was split into two groups. For Scenario 4 (medium dependence, Can-r = 0.49, and extraneous values within subpopulations, Fig. 2d), the number of true subpopulations was correctly estimated in four replicates, and their compositions were completely recovered. In the remaining replicate (the lowest line of Fig. 2d), six groups were estimated, four subpopulations were fully recovered, and the other was split into two groups. There was no mixture of observations among subpopulations for this replicate: therefore, the subpopulations were fully recovered.

In summary, for all replicates of Scenarios 3 and 4, all of the subpopulations were fully recovered with none of them mixing observations between subpopulations. In four replicates the correct number of subpopulations was identified; in six replicates one more group than the true number was estimated.

Scenarios 5 (HNS) and 6 (HNM)
These two scenarios had high overlapping on the continuous variables, no overlapping on the discrete variables, and a strong and medium association between W and the continuous variables. The appropriate number of groups was correctly identified in seven out of 10 replicates (Fig. 2e, 2f), and the true subpopulations were totally recovered. For Scenario 5 (strong dependence, Can-r = 0.86, Fig. 2e), three replicates correctly estimated the number of true subpopulations and their composition; the other two replicates predicted six groups. In both cases, one subpopulation was split into two groups without mixing observations among subpopulations. For Scenario 6 (medium dependence, Can-r = 0.65, Fig. 2f), four replicates correctly identified the true number of subpopulations and their composition. The other replicate estimated six groups, splitting one subpopulation into two groups but without mixing observations among subpopulations. In summary, for Scenarios 5 and 6, all of the subpopulations in all of the replications were totally recovered, seven times using the correct number of subpopulations and three times using one more group than the initially simulated number.

Scenarios 7 (HYS) and 8 (HYM)
These two scenarios had high overlapping on the continuous variables and on the discrete variables, and a strong and medium association between W and the continuous variables. The MLM after Ward analysis applied to the HYS scenario correctly identified the number of true subpopulations (Fig. 2g) in three out of five replicates. The correct compositions of the five subpopulations were also fully recovered in all replicates.

For Scenario 8 (HYM), the Ward-MLM strategy estimated four subpopulations (Fig. 2h). The analysis of the structure of the five true subpopulations and the four estimated groups obtained by the Ward-MLM are shown in Table 2 . The average Mahalanobis distance, D2 (Mahalanobis, 1930) among groups with respect to the continuous variables, D2cont, was higher among the five true subpopulations than among the four Ward-MLM groups (84.21 vs. 40.39). However, the average distance between groups with respect to the discrete variables, Ddisc, was higher among the four Ward-MLM groups than among the five true subpopulations (1.00 vs. 0.74); a similar result was obtained for the mixture of discrete and continuous variables measured by the Matusita (1956) distance, Dmixt (Krzanowski, 1983) (1.39 vs. 1.41).


View this table:
[in this window]
[in a new window]
 
Table 2. Comparison of the structure of the five true subpopulations and the four groups obtained by the Ward-MLM (modified location model) strategy on Scenario 8 (HYM): Distances among subpopulations and groups for continuous variables , for discrete variables (Ddisc), and for the mixture of variables (Dmixt); F{Lambda} approximation to the Wilk's lambda; and the influence of the continuous variables V1, V3, and V5 on the structure of the groups as measure by the correlation coefficient between the specific variable and the first canonical variable.

 
The F{Lambda} approximation for the Wilk's test measuring the statistical differences among vectors of group means in the multivariate analysis of variance showed statistical differences (P < 0.01) among group means in both cases (Table 2). The influence of each continuous variable (V1, V3, and V5) on the differences between groups, measured as the correlation coefficient between each variable and the first canonical variable (Mardia et al 1979), showed a similar level of importance of the three continuous variables for the five true subpopulations (0.75, 0.76, and 0.73). However, when using the four groups obtained by the Ward-MLM strategy, V1 and V3 had more effect on the groups' differences (0.99 and 0.97) than V5 (0.65) (Table 2). These results indicate that for the HYM Scenario, the Ward-MLM strategy produces well-separated groups with respect to the discrete variables and the mixture of discrete and continuous variables, but less well-separated groups with respect to only the continuous variables.

The data set structure of the five true subpopulations and the four groups (G4) generated by Ward-MLM strategy for Scenario 8 (HYM) are shown in Table 3 . The Ward-MLM strategy joined Cells 1 and 3 (from true subpopulations SP1 and SP2) to form group G4 = 1, with 187 = 148 + 39 observations. These two cells correspond to the combination (Q1, Q2) = (0, 1) having mean values for V1 and V3 (48.39, 122.31, and 73.79, 261.87, respectively) lower than the mean values for Cells 2 and 4 (81.22, 338.91, and 95.56, 383.66, respectively) also belonging to subpopulations SP1 and SP2. Note that V1 and V3 were the most important variables for differentiating the four groups (Table 2). Similarly, the Ward-MLM strategy merged Cells 2, 4, 5, and 6 (belonging to true subpopulations SP1, SP2, and SP3) with the combination (Q1, Q2) = (0, 2) and (Q1, Q2) = (0, 3) to form G4 = 2 with 187 = 2 + 11 + 68 + 6 observations. These cells had mean values for V1 and V3 [(81.22, 338.91), (95.56, 386.66), (116.35, 511.50) and (124.81, 584.55), respectively] higher than the above and lower than the remaining. This pattern is repeated in the formation of the other G4 groups.


View this table:
[in this window]
[in a new window]
 
Table 3. Description of the five true subpopulations (SP) and the four groups obtained by the modified location model after Ward (G4) analysis in Scenario 8 (HYM). Number of observations (size) and means for the continuous variables (V1, V3, and V5) are also shown.

 
These results show the operational mechanism by which the Ward-MLM strategy clusters observations, making simultaneous use of the information existing in both types of variables, discrete and continuous, and therefore forming balanced and well-separated groups.


    CONCLUSIONS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
Results of this study show that when the data are structured into subpopulations (regardless of some overlapping on the continuous and discrete variables), the Ward-MLM effectively predicted the original true number of subpopulations (25 out of 36 cases) and fully recovered their structures (35 out of 36 cases), despite the level of dependence between the multinomial variable and the vector of continuous variables (Scenarios 1–8). Only when the subpopulations showed a high degree of overlap on the multinomial variable, W, and on all the continuous variables (Scenario 8, HYM) did the Ward-MLM strategy predict a different number of subpopulations and fail to recover 100% of the observations belonging to the subpopulations. However, in that case, the Ward-MLM formed new groups (G4) that showed a balanced and consistent structure in their composition that reflected the influence of the two types of variables involved in the classification process.

The highly overlapping simulated data generated under Scenario 8 produced subpopulations with very indistinct boundaries. This should make the recovery of the true subpopulation difficult by any classification method. Nevertheless, the Ward-MLM strategy seemed to recover meaningful groups based on all available information.

Results of this study confirm the statement by Franco et al. (1998) that "the assumption of independence between the continuous variables and the multinomial variable (W) of the MLM model are not as restrictive as might appear at first glance." The MLM proved to be a robust model under medium and strong dependence between the variable W and the vector of continuous variables, and under various kinds of overlap on the continuous and discrete variables.

APPENDIX

The process of simulating each scenario and replication was done in two steps: (i) selection of the vector of means, and variance–covariance matrices per subpopulation and generation of the random multinormal data sets using the SAS MVN macro (SAS Institute, 2001), and (ii) generation of the multinomial variable W and its association with the continuous variables.

Simulation of the Continuous Variables
The first step in the generation of the first two data sets used in Scenarios 1 and 2 was to select the expected mean vectors and the expected variance–covariance matrices per subpopulation. This was done in stage as follows: (i) generating random ranges from a Uniform(10, 40) distribution for each subpopulation, (ii) defining the standard deviation as 1/3 of each range, and (iii) computing the common standard deviation, S, as the average value of the standard deviation of individual subpopulations.

The nonoverlap subpopulation means corresponding to the first dimension were separated by a random distance contained in the interval [3.5S–4.5S]. For the other dimensions and the high overlap condition we generated, for each subpopulation and dimension, random ranges from a Uniform(10, 40) distribution. We then defined the mean of each subpopulation in each dimension as the mean point of the range, and the standard deviation as 1/3 of each range. Finally, we calculated the common standard deviation as the mean of the previously defined standard deviations for each dimension.

For the second, third, fourth, and fifth dimensions, and the low overlap condition, we separated the subpopulation means in each dimension a distance of 3S, using the mean and the common standard deviation obtained for the first dimension, as and described above.

The expected covariances were calculated as: S = diag (V)-1 x R x diag(V), where diag(V) is a diagonal matrix containing the common standard deviation, S, for each dimension, and R is a correlation matrix obtained from a maize field evaluation of accessions from Uruguay obtained from the Latin American Maize Project (Taba et al., 1999). The variables from that evaluation were days to anthesis, days to silking, plant height, ear height, and grain moisture. The correlation matrix, R, was:

[A1]

For the third basic data set (used for Scenario 8), the mean vectors and variance–covariance matrix were obtained using the overall observed means (m) and the variances (s2) from five variables (days to anthesis, days to silk, plant height, ear height, and grain moisture) from a maize evaluation trial from the USA (Taba et al., 1999). The values m - 2S, m + 2S, m + 6S, m + 10S, and m + 14S were assigned as the simulated subpopulation expected means, allowing a distance of 4Ss between means of neighbor subpopulations on each dimension.

The variance–covariance matrix, S, and the corresponding correlation matrix, R, obtained from the real data, were:

[A2]

Once the expected vectors of means and the variance–covariance matrices were defined, we used the SAS MNV macro (SAS Institute, 2001) for generating the random multinormal values for each replication, within each subpopulation.

Simulation of the Multinomial Variable W
The process of assigning the discrete values of the W variable for forming the different dependence conditions (S = strong or M = medium) of Table 1, for Scenarios 1 to 7 (and their replicates), consisted in using linear combinations (modified by generating random uniform values, but not allowing linear dependencies of 1.0) of the continuous variables to create the ws values for each observation per subpopulation.

Scenario 8 (Table 1) used the third type of continuous data set. In this scenario we used only three of the five continuous variables (V1, V3, and V5) and generated two categorical variables from the remaining two continuous variables {V2 was highly correlated with V1 (0.99) and V4 was highly correlated with V3 (0.95); see R matrix of Eq. [A2]}. The first categorical variable is a binary (0, 1) variable that takes the value of 0 when V2 had a value less than or equal to its mean and the value of 1 otherwise. The second categorical variable is a multi-state variable that takes the values 1, 2, 3, and 4 when the fourth continuous variable (V4) had values within the intervals (P0, P.25), (P.25, P.50), (P.50, P.75), or (P.75, P1.0), respectively, where Pp is the pth observed percentile. Combining these two categorical variables, the W variable, with six levels, was generated. This procedure generated a medium association (Can-r = 0.63, Table 1) between the vector of continuous variables and the W variable. Scenario 8 has high overlap on all the continuous variables and also in the values of the multinomial variable.

For simulating other situations, two kinds of overlap on the multinomial variable were allowed (Table A1) . The first produced unclear boundaries among subpopulations because the same multinomial values were assigned to observations belonging to different subpopulations but having similar values for the continuous variables. This situation simulated real data sets in which unclear boundaries among subpopulations are encountered. The second type of overlap included extraneous values for W within subpopulations. For example, in Table A1, subpopulation SP1 takes values of W = 1,2,7, and 8 with frequencies 20, 20, 5, and 5, so the two last values can be considered extraneous values for this subpopulation (they are frequently values in SP4).


View this table:
[in this window]
[in a new window]
 
Table A1. Number of observations for different W values of the multinomial variable in each of the five true subpopulations (SP1, SP2, SP3, SP4, SP5) for the simulated data sets having unclear boundaries or extraneous values.

 

    ACKNOWLEDGMENTS
 
Valuable suggestions and recommendations made by three anonymous reviewers are greatly appreciated.

Received for publication September 5, 2001.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 




This article has been cited by other articles:


Home page
Crop Sci.Home page
J. Franco, J. Crossa, M. L. Warburton, and S. Taba
Sampling Strategies for Conserving Maize Diversity When Forming Core Subsets Using Genetic Markers
Crop Sci., February 24, 2006; 46(2): 854 - 864.
[Abstract] [Full Text] [PDF]


Home page
Crop Sci.Home page
J. Franco, J. Crossa, S. Taba, and H. Shands
A Sampling Strategy for Conserving Genetic Diversity when Forming Core Subsets
Crop Sci., May 6, 2005; 45(3): 1035 - 1044.
[Abstract] [Full Text] [PDF]


Home page
Crop Sci.Home page
J. Franco, J. Crossa, S. Taba, and S. A. Eberhart
The Modified Location Model for Classifying Genetic Resources: II. Unrestricted Variance-Covariance Matrices
Crop Sci., September 1, 2002; 42(5): 1727 - 1736.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via ISI Web of Science (5)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Franco, J.
Right arrow Articles by Crossa, J.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Franco, J.
Right arrow Articles by Crossa, J.
Agricola
Right arrow Articles by Franco, J.
Right arrow Articles by Crossa, J.
Related Collections
Right arrow Biometrics
Right arrow Plant Genetic Resources
Right arrow Statistics


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
The SCI Journals Agronomy Journal Vadose Zone Journal
Journal of Natural Resources
and Life Sciences Education
Soil Science Society of America Journal
Journal of Plant Registrations Journal of
Environmental Quality
The Plant Genome