|
|
||||||||
a Universidad Autónoma Agraria Antonio Narro, Departamento de Fitomejoramiento. Buenavista, Saltillo, Coah. C.P. 25315. Mexico
mhreyes{at}uaaan.mx
| ABSTRACT |
|---|
|
|
|---|
Abbreviations: BC, backcross cM, centimorgan GS, genomic selection MAS, marker assisted selection QTL, quantitative trait locus RAPD, random amplified polymorphic DNA RFLP, restriction fragment length polymorphism
| INTRODUCTION |
|---|
|
|
|---|
Although the backcrossing approach has been successful in many instances, one of the main limitations is the number of generations, and thus time, necessary to achieve the introgression objective. Classically, the expected fraction of genome from the recurrent parent in the bth backcross generation is calculated as 1-(1/2)b+1. However, this formula ignores "linkage drag" (Brinkman and Frey, 1977), i.e., the persistence of donor genetic material linked to the gene to be introgressed.
Linkage drag was considered by Hanson (1959). He developed predictors of the length of chromosome segments retained around a locus held heterozygous with backcrossing or selfing. Naveira and Barbadilla (1992) derived the theoretical distributions of the lengths of donor chromosome segments. Another approach, taken by Stam and Zeven (1981), predicts the proportion of donor genome in the resulting generation. The approach of Hanson (1959), which is based on average donor chromosome lengths, ignores the presence of donor chromosome segments in places of the genome that are non-adjacent to the gene to be introgressed.
The advent of DNA markers opens many possibilities for backcross-based introgression. For instance, with markers linked to specific quantitative trait loci (QTLs), it is possible to introgress specific regions of the genome that confer desirable quantitative characteristics to an elite variety (Tanksley et al., 1989; Paterson et al., 1991; Dudley, 1993). In tomato (Lycopersicon esculentum Mill.), lines have been created that contain QTLs from the wild species Lycopersicon hirsutum Hub. & Bonpl.. Such lines outperform the original elite variety in yield, soluble solids content, and fruit color (Tanksley and McCouch, 1997). This result was accomplished by the "advanced backcross QTL analysis," developed by Tanksley and Nelson (1996), and marker-assisted selection. DNA markers can be useful as well to select for maximum similarity to the recipient line and minimum similarity to the donor line (Hillel et al., 1990). This approach can help to expedite the recovery of recurrent parent background, while retaining the gene or genes to be introgressed.
Several authors have contributed to develop the theory and strategies of marker-assisted selection (MAS). Use of flanking markers tightly linked to the target gene in introgression programs was suggested by Young and Tanksley (1989). Under this scheme, individuals in a segregating backcross population can be scored for target genotype along with flanking RFLP markers aimed to recover the background genotype. If the markers are very close to the target gene, selection can be applied to one marker in one backcross generation and to the other marker in the subsequent generation, thus allowing a realistic population size.
Lande and Thompson (1990) derived selection indices to maximize the rate of improvement in quantitative traits under different MAS schemes, by combining the information on molecular genetic polymorphism with data on phenotypic variation. This scheme allows variations in selection intensity. The proposed selection indices therein can be applied to sex-limited traits and can use information from relatives. Also, this approach can be applied to marker selection of immatures, in which selection of seedling, embryos, or juveniles is based on molecular marker loci, followed by conventional phenotypic selection of the surviving adults.
The so-called "genomic selection" (GS), proposed by Hillel et al. (1990), employs as a selection criterion the degree of resemblance between DNA fingerprints of a candidate and that of the desired or undesired genome. They presented theoretical distributions and variances of the relative percentage of donor genome without considering information about map positions of markers. The GS scheme permits gene introgression with use information of DNA fingerprints to maximize the recipient genome and minimize the donor genome. Visscher et al. (1996) criticized the formulas of Hillel et al. (1990) because their model ignores recombination around the marker loci.
Hospital et al. (1992) studied the effects of time, selection intensity, population size, and number and position of selected markers in introgression breeding programs, on the expected proportion of recipient genome. They focused on the case where only one gene of interest from a donor parent is introgressed into a cultivar. They considered recurrent parent markers surrounding the gene of interest and found that rather distant markers better control the gene neighborhood in terms of recovering recurrent genome, unless high selection intensity can be applied. Additionally, they analyzed the use of recurrent parent markers in chromosomes not carrying the introgressed gene and report that increasing the number of markers to more than three per chromosome is not efficient. A possible limitation of their analytical approach to calculate the expected proportion of recipient genome in non-carrier chromosomes is the assumption of independence among all loci. However, their simulation results are in qualitative accordance with their analytical approach.
Visscher et al. (1996) investigated by simulation the relative gain in a backcross program using only markers, only phenotypes, or an index of markers and phenotypes. They found that markers were efficient in backcross programs for simultaneously introgressing an allele and selecting for the desired background. Marker spacing of 10 to 20 centimorgans (cM) gave an advantage of one to two backcross generations of selection, relative to random or phenotypic selection. In this and all other cases, the cited authors assumed a Poisson distribution of crossovers along the chromosomes; thus they based their calculations on the Haldane (1919) mapping function. Use of other mapping functions and introgression of several genes simultaneously have not been analytically addressed so far; however, the case of several markers and traits for other MAS schemes has been revised by simulation elsewhere, for example by Gimelfarb and Lande (1994).
In this work, I studied both analytically and numerically the outcome of backcrosses with selection on the basis of two types of marker alleles: those linked to the gene(s) to be introgressed ("donor markers") and markers used to recover the background genotype ("recurrent markers"). The objectives were (i) to derive functions for the probability of donor genetic material in a given site of the genome after a given number of backcrosses, and (ii) to derive the expected proportion of donor genome in a given chromosome and the whole genome.
As a starting point, my model considers selection of only "ideal genotypes", that is, those that have the ideal combination of marker alleles. To be realistic, this perspective requires either large population sizes or few markers. However, the model is further extended to the use of different sets of markers in each generation as a way to reduce the population sizes required for the selection process.
The advantage of this model over the previously published ones is that it allows any number of markers with any distribution along the genome; thus permitting predictions to be made in programs where several genes are being introgressed. Furthermore, this model is robust to interference, allowing the use of several mapping functions other than that proposed by Haldane (1919), which is unrealistic in many cases. For example, the Kosambi (1944) mapping function fits most data fairly well (Crow and Dove, 1990) and it can be used in this model.
| Genetic model |
|---|
|
|
|---|
The basic predictions in this model relate to the sets of chromosomes coming from the non-recurrent parent to form the selected plants of the bth backcross generation, i.e., they model "selected gametes" produced by the (b-1)th generation.
Let c be the coincidence (actual double recombinations)/(number expected with no interference), and assume that it approximates to (2r)k, where r is the recombination fraction in an interval between two markers, and k is a constant that depends on the mapping function to be used. For instance, the Haldane (1919) mapping function assumes no interference, thus c = 1 and k = 0; the Kosambi (1944) mapping function assumes that c = 2r, thus k = 1.
Let us denote by
(d) an inverse mapping function that converts a map distance d, given in morgans, to a recombination fraction r. In the case of the Haldane (1919) mapping function,
(d) = H(d) is:
![]() |
In the case of the Kosambi (1944) mapping function,
(d) = K(d) is:
![]() |
Consider three kinds of chromosome landmarks: chromosome ends (telomeres), recurrent markers, and donor markers. From these kinds of landmarks, six kinds of chromosome intervals are possible: endend, endrecurrent marker, enddonor marker, recurrent markerrecurrent marker, recurrent markerdonor marker, and donor markerdonor marker. The formulas in Table 1 give the probability of donor marker in a given site of the chromosome, and are specific for each kind of interval. The symbols herein used are defined as follows: x represents a position on a chromosome in morgans, considering an arbitrary chromosome end as the zero position; r represents the position of a recurrent marker and d is the position of a donor marker (which is linked to the target gene); b and k were defined above. The formula for the probability of donor genome in an end-end interval, i.e., in an unmarked chromosome, is well known in plant breeding textbooks, as well as the formula for an enddonor interval (Allard, 1960). The derivations of the remaining formulas are described in Appendix A.
|
2.
It is noteworthy that the recurrent markers are used only in the first generation of selection because they become fixed by the second generation. To state a function that assigns the probability of having donor genome at Position x in a given chromosome, we define a function that indicates whether or not the x value pertains to certain type of the six intervals listed above. Let S be a chromosome interval; thus, the indicator function is:
![]() |
Let g(x) be a function that gives the probability of a chromosome having donor genome at Position x, with domain (0, L), where L is the length of the chromosome in morgans:
![]() |
![]() |
![]() |
The subscripts of I, (z1, z2) are open intervals that represent unmarked chromosome segments bound by two landmarks. Symbols e1 and e2 are the arbitrary beginning (e1 = 0) and the arbitrary end (e2 = L) of the chromosome; r is the position of a recurrent marker; r1 and r2 are recurrent marker positions with r1 < r2; d is the position of a donor marker; d1 and d2 are donor marker positions with d1 < d2. The term I{d}, added to achieve continuity, means that any x in a donor marker position has a unit probability of having donor genome. Obviously, for the case of an x at a recurrent parent marker, the zero probability does not need to be stated in g(x).
In a way analogous to the work of Stam and Zeven (1981), the expectation of the proportion G of donor genome in the genetic map of a given chromosome of L morgans can be calculated as:
![]() |
The variance of the proportion G can be written as:
![]() |
So far the model has treated the case of the expected proportion of donor genome in a single chromosome. To extend this calculation to the whole genome, the following formulas can be used (Stam and Zeven, 1981). First, the expected proportion of donor genome in the n chromosomes can be computed as a weighted mean:
![]() |
![]() |
Calculation of the probability of donor genome at a Position x can be done in a straightforward manner with a hand calculator by applying the formulas given in Table 1. Computation of E(G) and VG, requires use of a computer routine for numerical integration, which is included as a built-in command in several commercial softwares. The programs developed during this work in Mathematica (Wolfram Research, Inc., Champaign, IL) are available free of cost from the author.
| Simulation results |
|---|
|
|
|---|
Some marker combinations used in the simulations (Table 2)
are unrealistic because a large number of plants would be necessary to screen in order to recover the desired combination; however, they served to test the model in a wide variety of circumstances. In the first case, a chromosome with four recurrent and two donor markers was assumed. In the second case, only one recurrent and one donor marker were considered. In the third case, two recurrent markers flanking one donor marker were considered. In all cases, the observed average proportions and their variances (in the upper row for each backcross in Table 2) were statistically tested against the expected ones (lower row for each backcross) according to the model presented in this paper. The average proportions of donor parent were compared against their theoretical expectations by two-sided z-tests with
= 0.05. For the case of sampling variances of proportions, they were tested against the theoretical estimates generated by the model presented here, by a bootstrap method with 5000 resamplings in each case and
= 0.05. For all situations, no significant statistical differences were detected in the observed average percentages and their variances, as compared with their theoretical expectations. In the first case, which includes six markers, a second backcross generation did not improve the outcome in terms of percentage of donor parent genome.
|
|
| Mapping functions |
|---|
|
|
|---|
For short map distances, e.g., less than 10 cM, most mapping functions have the same behavior in terms of conversion between recombination frequency and genetic distance; however, when genetic length increases, mapping functions show strong divergences. For example, the Haldane (1919) mapping function tends to overestimate genetic distances and underestimate recombination fractions, as compared with other functions. Thus, it is expected that, when the chromosome intervals between markers are long, the estimations of introgression become sensitive to the underlying assumption about interference, which in previous related works has been assumed to be zero.
Predictions of global donor genome proportions were compared for the Haldane (1919) and Kosambi (1944) mapping functions. Thus, for the first case
(the Haldane mapping function) and k = 0; for the second case
(d) = K(d) and k = 1 (Table 3)
. The comparisons were made for the three marker arrays previously treated and Backcrosses 1 and 2. In the first case, both estimations are fairly similar, which may be due to the closeness of the markers. In the second case, the differences are larger, especially for Backcross 2; this must be due to the long distance between the markers and the chromosome ends. In the third case, there is a considerable difference for Backcross 1. As can be seen in the last case (Table 2), the distances between flanking markers and chromosome ends are long. However, there is no consistency in terms of direction of the difference between both estimations.
|
| Limitations and extensions of the model |
|---|
|
|
|---|
As a matter of example, for the first marker array (Table 2), there will be approximately 0.596% of ideal genotypes in Backcross 1. This figure is obtained by assuming a Poisson distribution of chiasmata along the chromosome and multiplying the probabilities of recombination or no recombination in each interval [Numerically this is 0.5 x (1 - 0.275) x 0.165 x (1 - 0.165) x 0.165 x (1 - 0.275)]. The first factor is the probability of a gamete with the recurrent marker located at 20 cM. The second factor is the probability of no recombination between the first and the second recurrent marker, i.e., the conditional probability of recurrent marker at 60 cM given recurrent marker at 20 cM, and so on. Thus we need to screen 771 plants to have a probability of 0.99 to recover at least one ideal genotype [this number is calculated by solving for n in (1 - 0.00596)n = 0.01]. For Backcross 2, one expects to recover 41.8% of ideal genotypes, and the same value applies to the next backcross generations. For the case of the second marker array (Table 2), the expected percentages of ideal plants recovered in Backcrosses 1 and 2 are 6.5 and 50%, respectively. For the third marker array, the values are 1.36 and 50%, respectively.
A better strategy, in terms of reducing the number of plants to be screened, is the use of a set of recurrent markers in the first generation, and then a different set in the next generation. For instance, for the third marker array (Table 2), one can use the recurrent marker placed at 60 cM in Backcross 1 and the recurrent marker placed at 100 cM in Backcross 2. This way, it is expected to have 8.2% of "ideal plants" in Backcross 1 and 8.2% in Backcross 2. Approximately 54 plants would have to be screened in each backcross, totaling 108, to have a probability of 0.99 to recover at least one "ideal" genotype in each generation. If all the markers were selected in each generation, 337 plants would be needed in Backcross 1 and 7 in Backcross 2, totaling 344, i.e., more than three times the number required by the first strategy. In terms of reduction of donor genome, there is very little toll to pay with the second strategy. A fraction of 0.222 of donor genome was estimated for Backcross 2, against 0.196 (Table 2) with the first and more expensive strategy.
The model presented here can be applied in a straightforward way to the already mentioned case of different markers in each generation. Suppose that we want to apply marker-assisted selection in two backcross generations, each one with a different set of markers. In this case, the equation to estimate the expected global percentage of donor genome (G) will have the product g(x) g*(x) instead of g(x). The first function g(x) will have the parameters associated with the marker array in the first backcross and b = 1; the second function, g*(x), will have the set of parameters associated with the marker array to be used in the second backcross and b = 1. In general terms, a product of functions will be used, with each factor corresponding to one backcross generation, and fixing b = 1.
| Conclusions |
|---|
|
|
|---|
This model requires knowledge of the marker positions and estimation is based on the genetic map, rather than the genome itself. Therefore, the global fractions of donor genome are actually fractions of the genetic map and, although related to the physical map, it is not the same. The model does not distinguish regions of the genome that have the same nucleotide sequence between both donor and recurrent parent, thus what we call donor genome is the genetic material coming from the donor parent by DNA replication.
Use of different mapping functions gives different results, although the differences found in the cases treated here were not great from the practical point of view. However, use of mapping functions that are more reliable than those traditionally used for estimations in marker-based introgression does not introduce further complications to the model.
When phenotypic and marker selection is considered, along with any chosen selection intensity, a simulation-based method may be a more promising predictive tool than the analytical approach used in this work.
| ACKNOWLEDGMENTS |
|---|
Received for publication November 30, 1998.
| Appendix A |
|---|
|
|
|---|
![]() |
Now, the genetic distance between r and x is |x - r|, and the recombination fraction is
(|x - r|), where
(d) is a mapping function that converts a genetic distance d to a recombination probability p. The formula in Table 1 is obtained by substituting p by
(|x - r|).
Derivation of the RecurrentRecurrent Formula
The gamete from the F1 that will contribute to form the BC1 has undergone one meiosis, and the probability of a Position x of the target chromosome coming from the donor parent equals the probability of a double recombination event flanking the Position x within the interval (r1,r2). The conditional probability of this event, given that no recombination took place between the two recurrent parent markers, is (p1p2c)/(1 - p), where p1 and p2 are the recombination fractions between x and r1, and between x and r2, respectively, and p is the recombination fraction between r1 and r2. The factor c is the coincidence, which is assumed to be (2p)k, where k is a constant associated with the mapping function. As in the case of the end-recurrent formula, the probability of donor genome will halve each generation. Thus we have:
![]() |
By substitution with a mapping function as in the end-recurrent formula, we obtain the formula in Table 1.
Derivation of the RecurrentDonor Formula
Any gamete coming from F1 with the recurrent parent marker allele and the donor parent marker allele has undergone recombination between the two markers. To have donor parent genome at Position x, between both markers, recombination must have occurred between the recurrent parent marker and x, but not between the donor parent marker and x. The conditional probability of that event is:
![]() |
Derivation of the DonorDonor Formula
Presence of recurrent parent genome at x requires double recombination, one in the interval donorx and the other in xdonor. But the selected gamete showed no recombination between the markers. The conditional probability of double recombination given no recombination between the donor parent markers is:
![]() |
![]() |
The formula in Table 1 is obtained by substitution with a mapping function.
Variance of G
Following Stam and Zeven (1981), we define a Bernoulli random variable as follows
![]() |
The variance can be written as
![]() |
The derivation of the last expression can be seen on the paper of Stam and Zeven (1981). We already have a formula for the second term. For the first term we have
![]() |
![]() |
![]() |
![]() |
![]() |
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
G. Abalo, P. Tongoona, J. Derera, and R. Edema A Comparative Analysis of Conventional and Marker-Assisted Selection Methods in Breeding Maize Streak Virus Resistance in Maize Crop Sci., March 17, 2009; 49(2): 509 - 520. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Ishii and K. Yonezawa Optimization of the Marker-Based Procedures for Pyramiding Genes from Multiple Donor Lines: I. Schedule of Crossing between the Donor Lines Crop Sci., March 1, 2007; 47(2): 537 - 546. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Cervantes-Martinez and J. S. Brown A Haplotype-Based Method for QTL Mapping of F1 Populations in Outbred Plant Species Crop Sci., September 1, 2004; 44(5): 1572 - 1583. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| The SCI Journals | Agronomy Journal | Vadose Zone Journal | |||
| Journal of Natural Resources and Life Sciences Education |
Soil Science Society of America Journal | ||||
| Journal of Plant Registrations | Journal of Environmental Quality |
The Plant Genome | |||