|
|
||||||||
a Embrapa Rice and Beans, Santo Antônio de Goiás, GO, Brazil, 75375
b Dep. of Plant Breeding, Cornell University, 252 Emerson Hall, Ithaca, NY, 14853-1902
* Corresponding author (mes12{at}cornell.edu)
| ABSTRACT |
|---|
|
|
|---|
Abbreviations: LD: linkage disequilibrium AA: association analysis QTL: quantitative trait locus/loci
| INTRODUCTION |
|---|
|
|
|---|
While significant LD in random mating populations is evidence of tight linkage, population perturbations like migration, inbreeding, and selection can build up LD among loosely linked or even unlinked loci. Therefore, the characteristics of the population under study must be recognized when conducting AA and interpreting its results. Scientific plant breeding is a recent activity that normally involves a narrow genetic pool, such that breeding populations can be traced back to relatively few original parents, normally landraces, within a relatively small number of generations (e.g., Bered et al., 2002; Lu et al., 2005). Under this scenario, mutations play a minor role and most of the observed LD is expected to reflect the haplotypes of the original parents. Moreover, because there were few opportunities for recombination between the time of introduction of a parent and the present, LD in some plant breeding populations may not reliably indicate tight linkage. Between unlinked loci, LD can be caused by simultaneous selection of combinations of alleles at different genes, including epistasis, and by population structure (Hartl and Clark, 1997). Both phenomena should be common in plant breeding populations. Selection should affect LD in parts of the genome related to traits that are relevant for the breeding program. This source of distortion should be taken into consideration in the interpretation of results of AA in a case-specific manner. In contrast, population structure is expected to affect the pattern of LD over the whole genome and must be controlled a priori for correct association analysis (Pritchard et al., 2000b).
Most of the literature on AA refers to human populations or theoretical panmitic populations. There is limited information and discussion about applications of this technique to plant breeding. As the information generated by quantitative trait loci (QTL) studies accumulates, a method is needed to convert efficiently that information into practical tools for plant selection. Association analysis can be an effective approach for closing the gap between QTL analysis and marker-assisted selection.
The objective of this paper is to raise awareness among plant breeders of practical and theoretical aspects related to the application of AA in plant breeding programs. We compare three types of plant populationsgermplasm bank collections, synthetic populations, and elite lineswith respect to their potential and limitations as experimental materials for AA and propose that populational breeding represents a favorable setting for AA. We describe a model to make predictions about marker-gene associations in synthetic populations, which can be useful for evaluating the potential of a given population for AA, and forecasting the response to marker-assisted selection.
| Marker Allele Means Are Biased Estimators of Gene Allele Effects |
|---|
|
|
|---|
Significant association between marker and trait depends on differences between phenotypic means of lines carrying different marker alleles as an indication of the effects of a gene in LD with the marker. However, estimation of gene effects through molecular markers is susceptible to errors caused by sampling variance and systematic biases. Sampling errors cause overestimation of gene effects because the errors add to the differences between alleles, when the same data set is used for detection and estimation of effects. QTL x environment interaction is another example of sampling error related to the finite number of locations and years tested. For this reason, before undertaking the effort of transferring a QTL allele, cross validation is advisable to confirm its consistency and to estimate unbiased expectations of genetic gain. Cross validation can be achieved on the basis of independent data sets (Melchinger et al., 1998), preferentially in multiple environments, or by resampling from a larger data set (Schon et al., 2004). Previous studies have demonstrated that the cross-validated allele effect was only about half the amount initially detected and that frequently the presence of the QTL was not repeatable (Melchinger et al., 1998; Schon et al., 2004).
Sampling errors are a consequence of limited sample size. However, as demonstrated in the next section, marker allele means could be biased estimators of gene effect even in infinite samples.
| Association between a QTL and a Molecular Marker in Terms of Conditional Probability |
|---|
|
|
|---|
To express the association between the marker and the gene as a conditional probability, let us define y as a quantitative trait under control of w QTL in a breeding population, such that the phenotype of a pure line i can be described by the simplified statistical model
![]() |
qi is two times the additive effect of the allele carried by the line i at the QTL q (q = 1,...,w), and ei represents the error associated with the phenotypic evaluation of line i, ei
N(0,
2). Let L be the QTL under test and gi be the sum of additive effects of the alleles carried by line i at all other QTL (q = 2,...,w), or "polygenic effect." Then the statistical model can be simplified to
![]() | [1] |
A/2 and allele a has additive effect equal to zero (
a = 0).
Now consider the phenotypic values associated with a marker locus. Although the molecular marker is considered functionally neutral, it can affect the expectation of phenotypic value by changing the probabilities of the alleles at L and the expectation of the polygenic effect. The expected value of y can be expressed as a conditional expectation, given the allele M at a molecular marker locus J:
![]() |
![]() |
a = 0, the conditional expectations are E(y|AM) = µ +
A + E(g|M) and E(y|aM) = µ + E(g|M), consequently,
![]() |
![]() |
N(0,
2g), creating the settings for a mixed effects model. Since E(g) = 0,
![]() |
![]() | [2] |
A from the mean of plants carrying the marker allele M; therefore, it is desirable to maximize Pr(A|M) and to minimize cov(g,IM). Additionally, Eq. [2] also shows that rare alleles (N/nM > > 2) are more susceptible to the biases caused by covariances between the marker and the polygenic effects. | Population Structure Must Be Considered for Valid Association Analysis |
|---|
|
|
|---|
1i + ei, the covariance in Eq. [2] is included in the error term. If the marker J is in linkage equilibrium with the other QTL influencing y, the covariance of polygenic effects with the marker allele M is null [cov(g,IM) = 0] and creates no biases. However, if QTL alleles are arranged in any systematic way, the error term will no longer be identically and independently distributed, contradicting a basic assumption of the analysis of variance. The bias can be avoided to the extent that factors related to the covariances among QTL can be identified and included in the model. Races (e.g., indica or japonica rice) or major breeding pools (e.g., spring or winter wheat) represent strong population structuring and must be recognized in the analysis. Secondary subdivisions or hidden population structure can be inferred through unlinked marker data (Pritchard et al., 2000a).
Inclusion of population subdivision as random effects in a mixed model allows for the computation of unbiased estimates of allele ···effects. Considering a simplified case where individuals are discretely assigned to one of k subpopulations, without admixture, the variancecovariance matrix V(Y) can be represented as a group of submatrices (Littell et al., 1996):
|
|
2y =
2s +
2. Hence, the covariance between two lines would be
2s if they belong to the same subpopulation, or zero if they belong to different subpopulations. Comparisons between subpopulations have variance 2(
2s +
2), whereas comparisons within subpopulations have variance 2
2, which compensates for the inflating effect of covariances of polygenic effects with marker alleles, restoring the validity of the hypothesis test (Kennedy et al., 1992). More complex models can accommodate different levels of relationship and admixture of subpopulations (Yu et al., 2006). On the other hand, a gene that is polymorphic among subpopulations, but nearly monomorphic within subpopulations, will have its effect confounded with polygenic effects and is unlikely to be detected by AA (Deng, 2001).
Population stratification can be controlled in different levels of detail, depending on the desired level of confidence. As the population is divided in more subgroups, the probability of false positives is reduced, at the cost of a reduction in statistical power (Cardon and Palmer, 2003). The proportion of residual variance that is captured by population structure can be quantified in the mixed model as the intraclass correlation coefficient, icc =
2s/(
2s +
2) (Neter et al., 1996). A high icc indicates that a large proportion of the variance of the trait is observed between subpopulations.
| Choice of Populations for Association Analysis in Plant Breeding Programs |
|---|
|
|
|---|
|
Core collections are useful materials for AA of qualitative traits, such as disease resistance or special quality characteristics (color, aroma, etc.). Studies focusing on domestication-related traits such as seed dormancy, shattering, or inflorescence type also could require wide phenotypic variation, beyond the limits of cultivated germplasm (Clark et al., 2004). Conversely, the broad genetic variability of those collections normally make them unsuitable for analysis of quantitative traits because part of the accessions would be unadapted to growing conditions and prevalent diseases, resulting in poor precision of trait measurement.
Common ancestors of distantly related individuals occurred many generations ago; therefore, LD is expected to have decayed to short genetic distances. For this reason, AA in core collections will probably require candidate genes or major QTL mapped within narrow confidence intervals (Thornsberry et al., 2001). Compared with linkage-based fine mapping and positional cloning (Yan et al., 2003), the AA approach would offer the advantage of simultaneously detecting the effect and screening the germplasm for useful alleles. Significant markers would be useful for introgression of the new variation into elite germplasm through marker-assisted backcrossing (Frisch and Melchinger, 2005), while markers used for population structure inference could be used to speed up the recovery of the recurrent parent genome. Theoretical projections indicate that the use of two markers per chromosome for selection against the donor genotype could shorten the transfer by about two generations (Hospital et al., 1992).
Elite Lines and Cultivars
Maximum relative efficiency of marker-assisted selection compared with phenotypic selection is expected when heritability is low and markers capture a significant portion of the variation for the trait (Lande and Thompson, 1990). Elite lines are desirable materials for AA of low heritability traits, including yield, yield components, and tolerance to abiotic stresses because elite lines are genetically stable and are well adapted to normal growing conditions.
In plant breeding programs, there is normally a large body of phenotypic data accumulated for elite lines and cultivars from replicated field experiments over locations and years. Use of those data for AA requires statistical models accounting for covariances introduced both by experimental design (years, locations, replicates) and polygenic effects. Moreover, those data are often unbalanced because new lines are included in field trials each year, while other lines are discarded. Maximum likelihood solutions of mixed-effects models yield minimum-variance unbiased estimates of allele effects from unbalanced data, taking into account the correlation structure of the data (Pinheiro and Bates, 2000). Mixed-effects models were used to analyze plant height, disease resistance, and grain moisture in maize (Parisseaux and Bernardo, 2004) and grain size and milling quality in wheat (Breseghello and Sorrells, 2005).
Population structure can be prominent in elite material because it is common for closely related lines to be admitted to advanced trials. If pedigrees are known, the relationships among the lines can be determined (Bered et al., 2002) and used to control for polygenic effects (Zhang et al., 2005). In this case, it is not essential to estimate population structure through unlinked markers, although there may still be interest in marker data as a genetic fingerprint for variety protection (Röder et al., 2002) and for purity control of seed production.
A typical elite plant breeding pool is derived from few founders in the recent past, and is submitted to intense selection. For those reasons, LD is expected to be high in this material, and the first experimental results confirm this expectation (Ching et al., 2002; Tenaillon, 2001). Although AA in elite lines may not offer much improved resolution compared with QTL analysis in biparental mapping populations, there are at least two important advantages: a substantially higher level of polymorphism and detection of favorable alleles directly in the target population.
Elite lines are natural candidates for crossing to generate the next round of breeding, and significant markers could be used for marker-assisted selection in the progeny. However, the breeder needs to confirm whether a given pair of parents differing for the marker indeed differs for the gene, before using the marker as a proxy for selection (Table 2). With a less-than-perfect association between M and A, some lines carrying M may have a, whereas some lines with m may have A. In this way, although a cross M x m for the marker is more likely to be A x a for the gene, it can also be A x A, a x a or even a x A. Validation could be achieved by demonstrating association between F2 genotypes and F3 phenotypes for the quantitative trait. This test would have high statistical power because the design is balanced, no population structure is expected and no multiple testing is involved.
|
Genotypic information could be useful in all phases of population breeding. In the choice of parents to form the population, knowledge of the genetic distance among lines would be useful to achieve a compromise between high means for agronomic traits and high allelic variability. By genotyping samples of subsequent cycles with unlinked markers, breeders can monitor changes in allele diversity, effective population size, and population structure (Courtois et al., 2005; Ramis et al., 2005).
The allele diversity of synthetic populations depends on the number and divergence of parents and the intensity of selection applied. Genetic diversity can be expressed, among other measures, as the effective allele number, Ae = 1/
pi2, where pi is the frequency of allele i (Hartl and Clark, 1997). An approximate effective population size can be derived from estimates of LD (r2) among unlinked markers, as Ne = 1/(2r2) (Hedrick, 2005). Reduced effective population size can cause genetic drift. Conversely, allele changes beyond that expected from genetic drift for a given population size, indicate genomic regions that were probably affected by phenotypic selection (De Koeyer et al., 2001; Labate et al., 1999).
The level of LD in synthetic populations is expected to be high in the initial generations, such that a genome scan could detect large chromosome segments associated with traits, and trace them back to parental haplotypes. In subsequent generations, the decay of LD by recombination would favor increasingly refined mapping. However, synthetic populations are often submitted to recurrent selection, a breeding scheme consisting of successive cycles of evaluation, selection, and recombination (Fehr, 1987). Intense selection could build up LD by favoring allelic combinations or by promoting genetic drift (Palaisa et al., 2003). For this reason, populations subjected to mild or no selection would be preferred for AA. Laurie et al. (2004) developed a population for association analysis from the Illinois high/low oil populations, with 10 generations of recombination without selection.
In pedigree breeding, significant markers have to be confirmed in each cross, while in populational breeding, they can be included in selection indices, along with phenotypic information (Lande and Thompson, 1990), on the basis of their probabilistic association with the trait. The relative weight attributed to phenotypes and genotypes in the selection index could fluctuate according to the quality of the phenotypic evaluation in each cycle. When traits can be evaluated with precision, selection could be done on the basis of phenotypes, while associations with markers would be established. In cycles when field experiments fail to give precise data, selection could depend more heavily on genotypic data. This scheme represents a carry-on of information from a "good year" to a "bad year" through genetic markers. Selection based exclusively on marker data has been referred to as "genotype building" (Dekkers and Hospital, 2002), and it has been demonstrated by simulation that it could give genetic gains for a few generations following phenotyping, even if the linkage between genes and markers is not very tight (Hospital et al., 1997).
AA in synthetic populations under selection will require intensive genotyping because in each cycle, new progenies have to be tested to reflect the current state of the population and for implementation of marker-assisted selection. On the other hand, information about a population is cumulative over years, allowing a progressively refined genetic analysis of traits of interest to the breeding program.
A Genetic Model for Estimation of Pr(A|M) in Synthetic Populations under Recurrent Selection
We demonstrated that the association between marker and gene can be expressed as conditional probabilities and that synthetics are especially useful for AA in plant breeding. In the context of synthetic populations under recurrent selection, the conditional probability of the gene allele A, given the marker allele M, can be computed on the basis of the history of the population. This model assumes no epistasis, no genetic drift, and constant relative fitness coefficients.
Suppose that a parental line P was used in the synthesis of the population, contributing the exclusive allele A at the QTL L, and the allele M, not necessarily exclusive, at the molecular marker locus J. Let c represent the recombination frequency between L and J, and suppose that P contributed a proportion
of the genetic base of the population. Additionally, let
represent the frequency of the marker allele M which was not contributed by P. Under those settings, in the initial generation, the genotypic frequencies are Pr(AM|t0) =
, Pr(aM|t0) =
, Pr(am|t0) = 1
, Pr(Am|t0) = 0, where a and m represent alternative alleles at L and J, respectively (considering two biallelic loci). Genetic frequencies of the gene and marker alleles of interest are Pr(A|t0) =
, Pr(M|t0) =
+
; and the linkage disequilibrium between L and J is Dt0 =
(1
).
Considering a recurrent selection scheme, phenotypic selection can be made for A, marker-based selection can be made for M, or selection can be imposed on a combination of both. The two-locus fitness, without epistasis, can be estimated by the multiplication of the relative fitness of each locus (Hedrick, 2005), such that wAM.AM = wAA x wMM, with 0
w
1. Additionally, assume no linkage phase effect, such that wAM.am = wAm.aM = wh. In the plant breeding context, relative fitness is proportional to the chance of the individual of being selected by the breeder.
Applying standard population genetics theory (Hedrick, 2005, p.560), the expected frequencies of gametes carrying each combination marker-gene in the next generation (t + 1) can be computed on the basis of selection and recombination:
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
The Conditional Probability Pr(A|M) Is Maximum for Exclusive Marker Alleles
Using this model, the breeder can better understand the expected association between two loci in actual populations, on the basis of the tightness of linkage between them, the knowledge of the genetic base of the population, number of generations since its synthesis, and the intensity of phenotypic selection applied. Nevertheless, it should be clear that a large variance may be observed in real situations.
The conditional probability Pr(A|M) is a measure of LD between the marker and the gene. Markers at which the allele present in the parent P is exclusive in the population have maximum initial LD [Pr(A|M, t0) = 1]. In contrast, nonexclusive marker alleles (for which
> 0) have a lower starting LD. For example, consider a synthetic population formed by intercrossing 20 pure lines in equal frequency, in which the parent P contributed the exclusive gene allele A at the QTL L, located 1 cM from the SSR locus J, and 5 cM from the SSR locus K. If the allele carried by P at J was present in one of the 19 other parents used in the synthesis of the population (
= 0.05), whereas the allele at K was exclusive (
= 0), the allele at K is a better predictor of A than the allele at J, until the 17th cycle of recombination (Fig. 1
). For a number of generations,
is the major factor defining conditional probabilities.
|
Use of Significant Markers for Marker-Assisted Selection
Once a genetic marker has been demonstrated to be associated with a phenotypic trait of interest, it can be used as a selection target to obtain an indirect response in the trait. In recurrent selection, markers could be used to store information acquired from phenotypic evaluations, which can be used for selection in later cycles. Likewise, in pedigree breeding, markers could carry information about yield potential from the phase of replicated field trials to the phase of single-plant selection, when evaluation of yield cannot be made with reasonable precision. If the linkage between A and M is tight, genetic gain can be accelerated by including M in a selection index that considers several traits and markers simultaneously (Falconer and Mackay, 1996; Lande and Thompson, 1990).
Figure 2 shows changes in Pr(A) and Pr(A|M) caused by phenotypic selection, marker-based selection, and a combination of both in a recurrent selection scheme, for three levels of linkage. In this example, it was considered that each allele substitution reduced the relative fitness of the individual by 0.10 in the case of the gene and by 0.25 in the case of the marker. The higher impact of the marker in the relative fitness (here interpreted as the chance of selection by the breeder), is justified by its higher recognizability compared with a gene underlying a quantitative trait. When the marker is closely linked to the gene (c = 0.01), marker-based selection is approximately as efficient as combined selection. For loose linkage (c = 0.10), combined selection is more efficient than either method alone. In all cases, the use of the marker improved selection efficiency. This advantage must be compared with the additional costs of obtaining genotypic data to evaluate the economic efficacy of marker-assisted selection.
|
When Pr(A|M)
Pr(A), the marker can be considered exhausted as a tool for indirect selection. Further gains could be obtained by phenotypic selection or by shifting to another marker in closer association with the gene. The expectation through time is that the causative polymorphism is discovered and selected directly.
Received for publication September 7, 2005.
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
C. H. Sneller, D. E. Mather, and S. Crepieux Analytical Approaches and Population Types for Finding and Utilizing QTL in Complex Plant Populations Crop Sci., March 17, 2009; 49(2): 363 - 380. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Zhu, M. Gore, E. S. Buckler, and J. Yu Status and Prospects of Association Mapping in Plants The Plant Genome, July 1, 2008; 1(1): 5 - 20. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Simko and J. Hu Population Structure in Cultivated Lettuce and Its Impact on Association Mapping J. Amer. Soc. Hort. Sci., January 1, 2008; 133(1): 61 - 68. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Tuberosa, S. Salvi, S. Giuliani, M. C. Sanguineti, M. Bellotti, S. Conti, and P. Landi Genome-wide Approaches to Investigate and Improve Maize Response to Drought Crop Sci., December 18, 2007; 47(Supplement_3): S-120 - S-141. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Crossa, J. Burgueno, S. Dreisigacker, M. Vargas, S. A. Herrera-Foessel, M. Lillemo, R. P. Singh, R. Trethowan, M. Warburton, J. Franco, et al. Association Analysis of Historical Bread Wheat Germplasm Using Additive Genetic Covariance of Relatives and Population Structure Genetics, November 1, 2007; 177(3): 1889 - 1913. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Skot, J. Humphreys, M. O. Humphreys, D. Thorogood, J. Gallagher, R. Sanderson, I. P. Armstead, and I. D. Thomas Association of Candidate Genes With Flowering Time and Water-Soluble Carbohydrate Content in Lolium perenne (L.) Genetics, September 1, 2007; 177(1): 535 - 547. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Bernardo and J. Yu Prospects for Genomewide Selection for Quantitative Traits in Maize Crop Sci., May 31, 2007; 47(3): 1082 - 1090. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Ross-Ibarra, P. L. Morrell, and B. S. Gaut Colloquium Papers: Plant domestication, a unique opportunity to identify the genetic basis of adaptation PNAS, May 15, 2007; 104(suppl_1): 8641 - 8648. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| The SCI Journals | Agronomy Journal | Vadose Zone Journal | |||
| Journal of Natural Resources and Life Sciences Education |
Soil Science Society of America Journal | ||||
| Journal of Plant Registrations | Journal of Environmental Quality |
The Plant Genome | |||