Crop Science
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Published online 1 March 2007
Published in Crop Sci 47:887-890 (2007)
© 2007 Crop Science Society of America
677 S. Segoe Rd., Madison, WI 53711 USA
This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Camus-Kulandaivelu, L.
Right arrow Articles by Manicacci, D.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Camus-Kulandaivelu, L.
Right arrow Articles by Manicacci, D.
Agricola
Right arrow Articles by Camus-Kulandaivelu, L.
Right arrow Articles by Manicacci, D.
Related Collections
Right arrow Spatial Variability
Right arrow Plant Genetic Resources
Right arrow Crop Genetics

PLANT GENETIC RESOURCES

Evaluating the Reliability of Structure Outputs in Case of Relatedness between Individuals

Létizia Camus-Kulandaivelua, Jean-Baptiste Veyrierasa, Brigitte Gouesnardb, Alain Charcosseta and Domenica Manicaccia,*

a UMR 8120 Génétique Végétale, INRA UPS INA-PG CNRS, Ferme du Moulon, 91190 Gif sur Yvette, France
b UMR 1097 Diversité et Génomes des Plantes Cultivées, INRA Domaine de Melgueil, 34130 Mauguio, France

* Corresponding author (manicacci{at}moulon.inra.fr).


    ABSTRACT
 TOP
 NOTES
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 REFERENCES
 
Inference of population structure from neutral marker loci is a key issue for association genetics. However, the presence of highly related individuals, commonly observed in breeders' panels, may lead to deviation from model hypotheses and therefore unreliable group assignment. The present note proposes a tool to help interpret Structure software outputs on populations that include highly related individuals, such as plant breeding material. We ran Structure software on simple sequence repeat (SSR) data from two maize (Zea mays L. ssp. mays) inbred panels. We propose a criterion to evaluate Structure stability based on Euclidian distance between outputs. This approach shows a high stability across runs for the panel composed of first cycle inbred lines. On the contrary, the presence of highly related individuals in the second panel induces strong instability in Structure outputs.

Abbreviations: SSR, simple sequence repeat


    INTRODUCTION
 TOP
 NOTES
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 REFERENCES
 
INFERRING POPULATION structure from genetic markers is a key stage in fields such as population genetics or association-mapping studies. Indeed, population subdivision or admixture may generate linkage disequilibrium between distant chromosome regions. In association studies, this may lead to spurious associations between traits and single nucleotide polymorphisms, unless population structure is properly accounted for in statistical analyses. In several association studies or linkage disequilibrium studies in plants (Thornsberry et al., 2001; Liu et al., 2003; Wilson et al., 2004), population genetic structure has been assessed using the free software Structure (Pritchard et al., 2000; Falush et al., 2003). Structure relies on a model-based clustering method that uses a Bayesian framework to assign individuals to several genetic groups in a way to minimize within group linkage disequilibrium and deviation from Hardy–Weinberg equilibrium. However, Structure algorithm may not converge in some cases. There are three reasons for poor convergence: (i) a bad exploration of space, (ii) a problem of label switching along iterations, leading to equiprobable classification of individuals in all groups, and (iii) complex genetic structure (Pritchard et al., 2000; Falush et al., 2003). Problems (i) and (ii) can be solved by running the software for the same set of parameters several times. Problem (iii) has been addressed in many recent studies, either theoretical or in the field of animal and human genetics, which questioned the reliability of Structure results in cases of complex genetic structure. They showed that Structure may be sensitive to sample size, the type (amplified fragment-length polymorphism vs. simple sequence repeat [SSR], autosomal vs. sex-linked) and number of loci, and the run options such as "correlation among groups for allelic frequencies" (Bamshad et al., 2003; Rosenberg et al., 2003, 2005; Ramachandran et al., 2004; Evanno et al., 2005). Some of these studies checked for Structure ability to reassign individuals to their known genetic group of origin (Rosenberg et al., 2001, 2002) and to minimize admixture based on a clusteredness criterion (Rosenberg et al., 2005). However, contrary to human and animal structure, the genetic structure of plant species is generally poorly known a priori. It is thus seldom possible to evaluate Structure results based on assignation criteria. Furthermore, plant panels often gather related and/or admixed accessions that are of high interest for breeders but may artificially increase the number of groups (Falush et al., 2003). Yu et al. (2006) considered the relatedness within plant panels and proposed to account for the multiple levels of relatedness in association tests by using both Structure output and a kinship criterion. Yet, there is still a need to evaluate Structure results that do not rely on either individual assignation or clusteredness. In this note, we propose an ad hoc criterion to evaluate Structure results based on the across run stability of individual predicted allelic frequencies. Contrary to the individual membership that was used by Rosenberg et al. (2002) to define a similarity criterion, individual predicted allelic frequencies allow the comparison of runs with different group numbers and do not require a priori knowledge of individual group membership. We used this criterion to analyze Structure outputs obtained for two contrasted maize inbred line samples and show that output stability is very different for these two panels.


    MATERIALS AND METHODS
 TOP
 NOTES
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 REFERENCES
 
Plant Material
We used two maize inbred panels previously described by Camus-Kulandaivelu et al. (2006). The first panel, hereafter "first cycle inbred panel," consisted of 153 first cycle inbred lines obtained by selfing from ancestral landrace varieties. It thus minimizes relatedness among individuals. The second panel, hereafter "whole inbred panel," consisted of the 153 inbred lines from the first cycle inbred panel and 222 additional inbred lines from either most advanced selection cycles or synthetic populations. These two panels were defined to represent European and American diversity. They thus include material from the main gene pools from these origins: tropical inbred lines, European inbred lines, Corn Belt dent inbred lines, and northern flint inbred lines. These two maize panels were genotyped for 55 SSR markers of three or more base pair motives (Camus-Kulandaivelu et al., 2006). This number of loci, although moderate, has been reported to be sufficient to obtain robust clustering patterns with Structure for various sample sizes (see Rosenberg et al. [2005] for a review).

Structure Options
Structure software version 2.0 was run on both inbred panels. Rosenberg et al. (2005) showed that the assumption about correlation in allele frequencies across populations had a great impact on Structure results of clusteredness. Since the single domestication of maize is established (Matsuoka et al., 2002), we restricted our analysis to the correlated frequency model that assumes that all clusters originated from a common ancestral population through drift and/or selection.

The number of groups, K, is set initially for each Structure run, and optimal K value is determined based on the estimated log likelihood of the data (hereafter goodness-of-fit) following Pritchard et al. (2000). We performed 10 independent runs of Structure for each value of K, K varying from 2 to 10, for both the whole inbred panel and the first cycle inbred panel. Individual inbred lines were considered as haploid genotypes, one allele being taken at random in case of residual heterozygosity (less than 1%). We set other parameters to their default values, using admixture model and infer alpha option. As indicated by Liu et al. (2003), we chose burn-in and sampling periods of 5 x 105 iterations. Details of input options and output goodness-of-fit values are presented in Fig. 1 .


Figure 1
View larger version (15K):
[in this window]
[in a new window]

 
Figure 1. Goodness-of-fit (following Pritchard et al., 2000) as a function of group number K in (a) the first cycle inbred panel and (b) the whole inbred panel. Average estimated goodness-of-fit over 10 independent outputs are indicated with a solid line. Structure options are "Admixture model," "allele frequency correlated," "infer alpha," and "separate alpha for each population." Inbred lines were considered as haploid organisms to relax the Hardy–Weinberg condition on group definition.

 
Structure Stability
We calculated individual predicted allelic frequencies for each of the 90 Structure outputs obtained for the whole inbred panel and the first cycle inbred panel as

Formula 1[1]
where pjm(a) is the probability of inbred line j to have allele a in output m, fkm(a) is allele a frequency in cluster k in output m, qjkm is the proportion of inbred line j genome originating from group k in output m, and K is the number of groups in output m. For each panel, we additionally calculated the matrix of predicted allelic frequencies under hypothesis of no genetic structure (K = 1). In this case, all inbred lines have equal predicted frequency for a given allele (i.e., the average frequency of this allele over the panel).

For each panel, we calculated the average pairwise Euclidian distances between matrices of predicted probabilities, including the no genetic structure model, analogous to the square root of the mean square of differences between predicted allelic frequencies of models m and m', as

Formula 2[2]
where NBALL is the total number of alleles; NBIND is the total number of individuals; pjm(a) and pjm'(a) are the probabilities for individual j to have allele a according to m and m' outputs, respectively. For identical K value in m and m' outputs, Dmm' can be seen as a measure of Structure uncertainty in estimation of individual memberships and allelic frequencies in groups. Dmm' is a half square matrix which size is the number of Structure outputs compared (i.e., 91 in our study). The individual predicted allelic frequencies and the Dmm' matrices were generated using C programs, respectively proba2.c and distance.output.c, that are available on request from Létizia Camus-Kulandaivelu (camus{at}moulon.inra.fr).

Dmm' matrices were then used to build a neighbor joining tree using PHYLIP software (Felsenstein, 1989). To assess the stability of Structure ouputs, we calculated the average Dmm' among outputs for each group number K (Formula 2).

Measure of Relatedness among Individuals
We compared the distribution of kinship coefficient (Ritland, 1996) measured with SSR markers within (i) the first cycle inbred line panel and (ii) the subset of 222 additional inbred lines. Ritland's kinship coefficient is centered to 0, negative for individuals that are less related than average and conversely positive for individuals with higher relatedness than average, and may reach positive values as high as 4 for bi-allelic loci (see Table 1, Ritland, 1996). Calculations were performed with SPAGeDi software (Hardy and Vekemans, 2002) using the whole inbred line panel as reference population for allelic frequencies.


View this table:
[in this window]
[in a new window]

 
Table 1. Average distance among runs Formula 2 over 10 Structure replicates for each group number K in the two maize inbred panels.

 

    RESULTS AND DISCUSSION
 TOP
 NOTES
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 REFERENCES
 
The optimal number of groups that describes a given panel is determined by comparing goodness-of-fit values provided by Structure for each run under an a priori number of groups K. According to Structure documentation (http://pritch.bsd.uchicago.edu/software/readme_2_1/readme.html; verified 19 Feb. 2007), goodness-of-fit is expected to show a quick increase until reaching an optimal group number and then present a "plateau phase" characterized by constant value or very low increase. For the first cycle inbred panel, a clear plateau phase was reached for K = 5 (Fig. 1a). On the contrary for the whole inbred panel, the average goodness-of-fit showed constant increase, although maximum goodness-of-fit exhibited a reduced increase from five to six groups and from seven to eight groups (Fig. 1b).

To evaluate Structure stability, Dmm' matrices were used to build neighbor joining trees of the 91 Structure outputs (10 replicates for K varying from 2 to 10, as well as the no genetic structure model) for each panel (Fig. 2 and 3 ). For first cycle inbreds, Fig. 2 shows a clear clustering of outputs with identical K, except for 9 and 10 group outputs that are mixed together. Conversely, the neighbor joining tree based on the whole inbred panel outputs (Fig. 3) shows a complex pattern: only the three group outputs cluster together, the two group outputs are divided into two clusters, and no clear structure was found for outputs with more than three groups. Also, for first cycle inbreds, Formula 2 shows the lowest values for two and three groups (i.e., 0.0009 and 0.0010 respectively), and then increases to values varying around 0.0060 for four and five groups and fluctuating between 0.0138 and 0.0188 for 6 to 10 groups (Table 1). For the whole inbred panel, this value is as low as 0.0006 for three groups and equal to or higher than 0.0268 for other group numbers.


Figure 2
View larger version (11K):
[in this window]
[in a new window]

 
Figure 2. Neighbor joining tree for 91 Structure outputs, including the no genetic structure model, obtained with the maize first cycle inbred panel. For each output, group number K is indicated at leaf positions. *, best goodness-of-fit for each group number.

 

Figure 3
View larger version (16K):
[in this window]
[in a new window]

 
Figure 3. Neighbor joining tree for 91 Structure outputs, including the no genetic structure model, obtained with the maize whole inbred panel. For each output, group number K is indicated at leaf positions. *, best goodness-of-fit for each group number.

 
Our results show that the Structure algorithm provides a clear-cut view of population structure and individual assignation into genetic groups for the first cycle inbred panel. Structure outputs are reasonably stable for each group number, with an optimal number of five groups easily determined from goodness-of-fit and stability values. Indeed, for six groups or more, goodness-of-fit reaches a plateau and stability among outputs decreases. Although the whole inbred panel fully includes the first cycle panel, such population structure does not clearly appear from Structure outputs run under identical options. First, goodness-of-fit increases regularly with group number without showing any plateau phase and second, Structure stability is poor (except for three groups). For Structure runs on the whole inbred panel with K higher than 5, additional groups are made of highly related "families" of inbred lines (Camus-Kulandaivelu et al., 2006). Converse to what we observed, recent studies focusing on sample effects on Structure outputs show that increasing sample size improves clusteredness, when other parameters such as the number of genetic markers are equal (Rosenberg et al., 2005). As we observed the reverse pattern in the present study, the discrepancy between the two panels may rather be explained by qualitative differences between samples such as the addition in the whole inbred panel of many elite inbreds with high and complex relatedness.

To assess the difference in relatedness among panels, we compared the distribution of kinship coefficient (Ritland, 1996) measured with SSR markers within (i) the first cycle inbred line panel and (ii) the subset of 222 additional inbred lines. The 222 additional inbred lines show significantly higher variance (0.005338) than the first cycle inbred panel (0.004698), mainly due to very related pairs of individuals in the former panel (the highest kinship value is 2.641 for the 222 additional inbred lines and 1.030 for the first cycle inbred lines). Indeed, the 222 additional lines include modern inbreds originating from a reduced number of progenitors following various breeding methods that include backcrossing. The strong relatedness among some inbred lines leads to different grouping possibilities and thus low Structure output stability. Indeed, Structure proceeds by extracting first the more distant populations and then, when increasing group number, by identifying less genetically distant groups. Families of highly related individuals introduce sufficient correlations among some individuals to make them be identified as genetic groups. Conversely, the first cycle inbred panel in our study is made of lines directly originating from landraces, for which population history makes the panmixy and low linkage disequilibrium hypothesized by Structure more reasonable. This case study indicates that relatedness among individuals strongly affects Structure outputs. High relatedness is very common in species with intensive breeding history. In those cases, care should be recommended in the use of Structure, and outputs should be interpreted by combining both the examination of stability statistics, as proposed in the present note, and empirical knowledge of the plant material. The examination of Structure group composition in the light of the breeder's empirical knowledge should help to determine Structure output consistency, as described by Camus-Kulandaivelu et al. (2006). When strong instability is observed, we advise first running Structure on a subsample representing the genetic diversity of the whole panel, removing families of related accessions. Such a subsample can be defined using software such as MSTRAT (Gouesnard et al., 2001) that maximizes genetic diversity by identifying core collections of the desired size. Once the genetic structure of the subsample has been assessed, admixture proportions of the additional individuals could be calculated, assuming that the population allelic frequencies are equal to the one previously estimated. This option is not implemented in the current Structure version and we are developing a specific software to do it.

Besides increasing the knowledge of population structure, Structure software is often used as a preliminary stage before association genetics. In this context, plant breeders often wish to include elite highly related material in their panels. Statistical significance of some associations may be artificially increased by phenotypic and genetic specificity of such families, leading to spurious associations of any family specific allele wherever in the genome. Our approach may help detect such situations. Association genetics requires methods that take both population structure and strong relatedness within some families into account, such as the unified mixed model (Yu et al., 2006) based on both Structure outputs and pedigree estimation.


    ACKNOWLEDGMENTS
 
We are grateful to M. Dupin, J. Laborde, and colleagues at Saint Martin de Hinx for managing the inbred line collection analyzed here. Simple sequence repeat analyses were conducted at INRA le Moulon by D. Madur, V. Combes, and F. Dumas and were funded by INRA and Genoplante. L. Camus-Kulandaivelu is funded by a grant from INRA and the Languedoc-Roussillon region.


    NOTES
 TOP
 NOTES
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 REFERENCES
 
All rights reserved. No part of this periodical may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Permission for printing and for reprinting the material contained herein has been obtained by the publisher.

Received for publication June 7, 2006.


    REFERENCES
 TOP
 NOTES
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 REFERENCES
 




This article has been cited by other articles:


Home page
Crop Sci.Home page
C. H. Sneller, D. E. Mather, and S. Crepieux
Analytical Approaches and Population Types for Finding and Utilizing QTL in Complex Plant Populations
Crop Sci., March 17, 2009; 49(2): 363 - 380.
[Abstract] [Full Text] [PDF]


Home page
The Plant GenomeHome page
J. Yu, Z. Zhang, C. Zhu, D. A. Tabanao, G. Pressoir, M. R. Tuinstra, S. Kresovich, R. J. Todhunter, and E. S. Buckler
Simulation Appraisal of the Adequacy of Number of Background Markers for Relationship Estimation in Association Mapping
The Plant Genome, March 1, 2009; 2(1): 63 - 77.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
L. Camus-Kulandaivelu, L.-M. Chevin, C. Tollon-Cordet, A. Charcosset, D. Manicacci, and M. I. Tenaillon
Patterns of Molecular Evolution Associated With Two Selective Sweeps in the Tb1-Dwarf8 Region in Maize
Genetics, October 1, 2008; 180(2): 1107 - 1121.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
B. Stich, J. Mohring, H.-P. Piepho, M. Heckenberger, E. S. Buckler, and A. E. Melchinger
Comparison of Mixed-Model Approaches for Association Mapping
Genetics, March 1, 2008; 178(3): 1745 - 1754.
[Abstract] [Full Text] [PDF]


Home page
Crop Sci.Home page
J.-B. Veyrieras, L. Camus-Kulandaivelu, B. Gouesnard, D. Manicacci, and A. Charcosset
Bridging Genomics and Genetic Diversity: Linkage Disequilibrium Structure and Association Mapping in Maize and Other Cereals
Crop Sci., December 18, 2007; 47(Supplement_3): S-60 - S-71.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Camus-Kulandaivelu, L.
Right arrow Articles by Manicacci, D.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Camus-Kulandaivelu, L.
Right arrow Articles by Manicacci, D.
Agricola
Right arrow Articles by Camus-Kulandaivelu, L.
Right arrow Articles by Manicacci, D.
Related Collections
Right arrow Spatial Variability
Right arrow Plant Genetic Resources
Right arrow Crop Genetics


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
The SCI Journals Agronomy Journal Vadose Zone Journal
Journal of Natural Resources
and Life Sciences Education
Soil Science Society of America Journal
Journal of Plant Registrations Journal of
Environmental Quality
The Plant Genome