|
|
||||||||
a Dep. of Biometry, University of Nebraska, Lincoln, NE 68583 USA
b Dep. of Agronomy, University of Nebraska, Lincoln, NE 68583 USA
keskridge1{at}unl.edu
| ABSTRACT |
|---|
|
|
|---|
Abbreviations: ANOVA, analysis of variance CNN, `Cheyenne' G x E, genotype x environment interaction RICLs, recombinant inbred chromosome lines WI, `Wichita'
| INTRODUCTION |
|---|
|
|
|---|
Generally, two strategies have been adopted to determine the number and location of genes on a chromosome using RICL populations, namely, (i) interpreting the phenotypic distribution of the trait and (ii) following the mongenic inheritance of marker genes (e.g., disease resistance genes) linked to the quantitative trait of interest (Law, 1966, 1967; Law et al., 1976; Snape et al., 1985). However, these approaches are restrictive for the determination of gene number and the nature of gene action. Since the first approach depends entirely on the recognition of discrete classes, where such discontinuities are often not found it is difficult to estimate the number of genes controlling a quantitative trait. In the second case, for a single chromosome, there often are not enough gene markers available in wheat. The need to overcome this difficulty is important and it would be useful for plant breeders, using RICLs, to know the number of loci (k) that differ between the chromosome substitution line and the parent cultivar that control a desired quantitative trait.
The method of Wehrhahn and Allard (1965) may be used to estimate the number of segregating loci responsible for differences between a chromosome substitution line and a parent cultivar and to test hypotheses. The Wehrhahn and Allard method has several advantages over other biometrical approaches, such as the Castle-Wright model, since the method can circumvent problems due to transgressive segregation and variation among loci for allelic effects (Lynch and Walsh, 1998). However, the Wehrhahn and Allard method has several limitations when using RICLs. With field data, there is often a considerable chance of incorrectly classifying RICLs as either parental or nonparental types. Wehrhahn and Allard's method does not account for these errors, and the estimate of k can be seriously biased when the probability of misclassification is large. In addition, Wehrhahn and Allard's method is based on the assumption of unlinked loci. However, RICLs differ by only one chromosome, or a part of a chromosome, resulting in a good chance of tightly linked loci, which may seriously bias Wehrhahn and Allard's estimate.
The objective of this study was to describe an approach to estimate and test hypotheses about the number of loci with genes that differ between a parent cultivar and a chromosome substitution line when studying RICLs populations. The method explicitly incorporates errors of incorrectly classifying RICLs as either parental or nonparental types, and we consider how the estimates are affected if loci are linked. The method is used to estimate the number of genes controlling yield, yield components (kernels spikes-1, 1000-kernel weight, spikes m-2), grain volume weight, plant height, and anthesis date in wheat.
| Theory and methods |
|---|
|
|
|---|
The replicated trials are used for determining the number of RICLs that differ in response from the parent cultivar. At any particular locus (with alleles A and a, where a is from the parental cultivar), the probability that a RICL differs from the parent cultivar is q = 1/2 (Wehrhahn and Allard, 1965). This follows since the cross of two purelines results in homozygous disomic lines of which one-half have allele A. Thus the proportion of all lines that are A, that is, differ from the parent cultivar, is q = 1/2. When the loci are unlinked, the expected proportion of lines that differ from the parent cultivar at one or more of the k loci is 1 - (1- q)k (Wehrhahn and Allard, 1965; Mulitze and Baker, 1985). If there are m lines and r differ from the parent, the estimated proportion is
= r/m.
The number of loci with genes affecting the difference between the parental and substitution line for the trait of interest may be estimated by setting the observed proportion of RICLs that differ from the parental cultivar equal to 1 - (1- q)k and solving for k (Wehrhahn and Allard, 1965; Mulitze and Baker, 1985). For example, given m lines with r RICLs that differ from the parent cultivar, and m - r that do not, the proportion of lines that differ from the parent is
= r/m. Solving the equation
= 1 - (1 - q)k for k with q = 1/2 gives
![]() | (1) |
Derived in this way,
is the moment estimator of k, but
can also be shown to be the maximum likelihood estimator of k (Agresti, 1990; Eskridge and Coyne, 1996).
is based on the assumptions of no epistasis, no linkage, and normal diploid meiosis. Any deviation from these assumptions will probably decrease the precision of the estimates as stated in Mulitze and Baker (1985).
In the case of RICLs, the assumption of no linkage may not be justified since the lines differ from the parent only by one chromosome or by a part of a chromosome. Loci on the same chromosome may be linked. Linked loci will bias
, with the direction and magnitude of the bias depending on the form (coupling or repulsion) and strength of the linkage. For example, let Ai (i = 1, ..., k) be an allele of the ith locus from the substitution line that improves the trait compared with the parent, and let ai be the allele from the parent for the same locus. The probability that all k loci contain the parental alleles is
. If it can be assumed that all loci that affect differences between the parent and the substitution line are either unlinked or in coupling phase, then for two linked loci, i and j, P(ai
aj) > P(ai)P(aj). Thus,
![]() | (2) |
Now if P(ai) = 1/2, Eq. [2] results in P(
ai) > (1/2)k. The estimate
is chosen to make (1/2)*
= P(
ai) where 1 - P(
ai) is estimated from the data using
. Since with the true k, (1/2)k is less than P(
ai), the estimate
will generally be smaller than the true k. That is,
is an underestimate of k if two or more loci are in coupling phase and the remaining are independent. Similar reasoning can be used to show that if some pairs of loci are in repulsion phase while the remaining are unlinked, P(ai
aj) < P(ai)P(aj) and
will overestimate k.
The estimate
is also based on the assumption of correct classification of the RICLs into parental or nonparental types. In practice, a parental line may be incorrectly classified as a nonparental line (Type I error) or a nonparental line may be incorrectly classified as a parental line (Type II error). Failure to account for these errors will result in biased estimates of k. An unbiased estimate of k is possible only if it is based on an unbiased estimate of the true proportion (P) of RICLs that differ from the parental line (Mulitze and Baker, 1985). In most previous applications,
has been used as an estimate of P. With classification errors,
is a biased estimate of P and thus using
in Eq. [1] will lead to a biased estimate of k. With large classification errors, this bias can be severe.
To obtain an unbiased estimate of P, assume that m RICLs are available for classification where the presence of one or more substitution line alleles (Ai) makes the RICL different from the parental cultivar. Suppose that each nonparental RICL has probability 1 - ß of being correctly classified as a nonparental line [i.e., 1 - P(Type II error) = 1 - ß], while the probability that a parental RICL (i.e., contains only parental alleles, ai) is incorrectly classified as a nonparental is
where the P(Type I error) =
. Let y denote the unknown number of nonparental RICLs out of m. If x is the number of RICLs correctly classified as nonparental and w is the number of m - y parental RICLs that are incorrectly classified as nonparental lines, then the estimated number of nonparental RICLs is r = x + w. Kotz and Johnson (1982) show that r has a binomial distribution with parameters m and P(1 - ß) + (1 - P)
. To obtain an unbiased estimate of P, set r/m (the biased estimate of a P) equal to P(1 - ß) + (1 - P)
and solve for P, which results in the following formula:
![]() | (3) |
Since r/m is obtained from the experiment and
is set by the researcher when classifying lines,
may be computed given an estimate of ß. Ideally, ß would be estimated given the sample size, an estimate of the experimental error variance, and the mean difference between the parent cultivar and the nonparental lines. An estimate of k, adjusted for classification errors, could then be obtained by using
instead of
in Eq. [1].
In most applications, it will not be possible to estimate ß directly since rarely will the geneticist know the mean difference between the parent cultivar and the nonparental lines. This mean difference depends on k, giving rise to a seemingly circular argument: to obtain an error adjusted estimate of k, one must know ß, but to estimate ß, one must know k. One solution to this problem is to use an iterative scheme to estimate k. Begin with an initial k(i) at i = 0. Use k(i) to estimate ß(i), then using ß(i) in Eq. [3] and [1], estimate k(i + 1). Substitute k(i + 1) for k(i) and continue until k(i + 1) - k(i) is small.
To estimate k using this iterative scheme, it is necessary to specify how the mean difference between the parent cultivar and the nonparental lines is influenced by k. If loci are unlinked, each with two alleles (Ai, ai; i = 1, ..., k) and each Ai allele having the same effect (
), then it may be shown that the mean difference between the parent cultivar and the nonparental lines is [2k-1/(2k - 1)](µs - µp) where µs and µp are the means of the substitution line and parent cultivar, respectively (see Appendix). Estimating this mean difference, using it to compute ß based on the standard power formula (e.g., see Eq. [5.30] in Steel and Torrie, 1980), substituting this ß into Eq. [3] and substituting Eq. [3] into Eq. [1] results in an iterative equation for k:
![]() | (4) |
s -
p
s and
p
(.)

(Z1-
) = 1 -
.
Iterative use of Eq. [4] will generally converge to unique k estimates, which do not depend on the initial value of k(0). Here we used k(0) = 1. However, some cases may occur where final estimates depend on the starting values. In such cases, care should be used in interpreting estimates of k.
Once a final estimate of k has been obtained, weighted least-squares, as described below, may then be used on this new estimator to obtain standard errors and to test hypotheses. Weighted least-squares estimates are not possible if
0 or
1. To avoid such values of
,
is set to 0.01 when (r/m)
, and
is set to 0.99 if (r/m)
1 - ß. See Agresti (1990) for a technical discussion of the effects of adding small constants to obtain weighted least-squares estimates.
Standard errors and hypothesis tests may be based on weighted least-squares (Grizzle et al., 1969; Eskridge and Coyne, 1996). Assume there are s independent groups, where group is any factor (e.g., environment or trait) thought to explain variation among the k values. For the ith group,
1 is obtained from Eq. [3] with the final estimate of ß based on the final estimate of k from Eq. [4]. Also, let mi be the number of lines in the ith group (i = 1, ..., s) and
= (
1 ...
s), where ki is a function of Pi as expressed via Eq. [1] using
i. The covariance matrix of
, S is an s x s diagonal matrix with Si (i = 1, ..., s) values on the diagonal, where Si
Hi2Vi , Vi = var(
i) = (1/mi)
i(1 -
i), and Hi = dki/d
i evaluated at Pi =
i. Now define the model k = Xß, where X is an s x u design matrix and ß is a u x 1 vector of coefficients. Estimated weighted least-squares may be used to estimate ß: ß = (X'S-1X)-1 X'S-1
, which has the covariance matrix (X'S-1X)-1. Estimates of various quantities (e.g.,
i,
i -
j, etc.) may be obtained by appropriately defining a 1xu vector of constants, l', and computing l'
. The standard error of the estimate l'
is [l'(X'S-1 X)-1 l']1/2. Any linear hypothesis that may be stated as H0: L
= 0, where L is c x u matrix of rank c with c
u, may be tested with X2 = (L
)'[L(X'S-1 X)-1 L']-1(L
), which is asymptotically chi-square with c degrees of freedom when H0 is true. The GENMOD procedure in SAS may be used to obtain weighted least-squares estimates and standard errors using an appropriately defined LINK and ILINK statements (see Appendix) (SAS Institute, 1997). In this application, trait is used as the group factor to simplify programming.
To demonstrate the approach, data were used from a study on the inheritance of yield, kernels spike-1, 1000-kernel weight, spikes m-2, grain volume weight, plant height, and anthesis date, using a population of RICLs for chromosome 3A of hexaploid wheat (Shah et al., 1999a). Chromosome 3A of winter wheat `Wichita' (WI) differs from that of `Cheyenne' (CNN) by a number of important quantitative traits (Berke et al., 1992). Fifty recombinant inbred chromosome lines for chromosome 3A were obtained from a cross between a hard red winter wheat CNN and a chromosome substitution line CNN(WI3A) where chromosome 3A of WI was substituted for its homologue in CNN. In the F1, the only effective recombination occurs between WI3A and CNN3A chromosome, as all the other chromosomes should be from CNN. The resulting crossover products were isolated by crossing the F1 (as male) to the parent cultivar, CNN monosomic for chromosome 3A as female, and selecting the monosomic progeny (recombinant monosomic lines). Upon selfing the recombinant monosomic lines and selecting the disomic progeny, fifty homozygous RICLs were developed in the CNN background. The selection of monosomic or disomic lines was carried out by cytological examination of root-tip cells. These 50 RICLs were grown in replicated field trials using a randomized complete block design during 3 yr (19941996) in four to nine diverse environments (Shah et al., 1999a). Single degree of freedom contrasts were tested, using the genotype x environment (G x E) mean square as the error variance, to identify which of the 50 lines (RICLs-3A) differed significantly (P < 0.05) from the parental cultivar (CNN) for each of the seven agronomic traits. Using the number of lines (RICLs-3A), out of 50, that significantly differed from the parent cultivar (CNN), the weighted least-squares approach was used to estimate the number of loci with gene(s) affecting the difference between the parental cultivar CNN and the chromosome substitution line CNN(WI3A) for these traits. Two sets of estimates were computed: (i) ignoring classification errors and (ii) correcting for classification errors.
| Results and discussion |
|---|
|
|
|---|
and
adj) indicated that a single locus (or possibly a group of tightly linked genes) was segregating for 1000-kernel weight and plant height, while two loci were segregating for anthesis date (Table 1). Neither the corrected or uncorrected estimates indicated differences between CNN and CNN(WI3A) in the number of genes for grain yield, kernels spike-1, spikes m-2, and grain volume weight. These traits are known to have large G x E interaction effects, which probably obscured the genetic effects for the trait of interest, making it difficult to identify differences between the RICLs and CNN.
|
than for ß, while a larger corrected estimate, compared with the uncorrected, would mean more of an adjustment for ß. If the power had been very poor (1 - ß < 0.20), then
adj could have been substantially larger than the corrected estimate
. However, with these traits, the power was greater than 0.95, causing only a small adjustment to the estimates. With these data, the correction for misclassification was small. However, the correction would probably be large in trials where the substitution line differed from the parent but there was only a small chance of correctly identifying a RICL as nonparental. Trials such as these have genetic differences between the parental and substitution lines, but because of misclassification, the unadjusted estimate would be an underestimate of the number of genes. However, if there is very little segregation among the RICLs, the adjustment will be minimal even if power is extremely poor.
The sensitivity of the final k (Eq. [4]) to the initial k(0) is an important consideration when using this procedure. For all traits, except 1000-kernel weight, the final adjusted estimates were unaffected by the initial k(0) in the range of 0 to 6. However, for 1000-kernel weight, the final k converged to
adj = 0.658 when k(0) < 2, whereas it converged to
adj = 6.64 when k(0) was 2 or larger. Both estimates were reasonable given the data. For
adj = 0.658, the final 1 - ß estimate was 0.951, indicating a nonparental line would probably be correctly classified as nonparental. Given that 38% of the lines were classified as nonparental, the single gene model was reasonable. Alternatively, when
adj = 6.64 (a large number of genes for our method), the 1 - ß estimate was 0.33, indicating a poor chance that a nonparental line would be correctly classified. With k large, most of the lines would be nonparental types, but only about one-third would be correctly classified as nonparental. Thus, 38% of the RICLs being classified as nonparental was consistent with a large number of genes given a large ß. With these data, the final k estimate was affected by k(0) when ß was quite sensitive to k (t
2.5 in Eq. [4]) and when
was between 0.3 and 0.4. In general, final k estimates should be obtained for several values of k(0) and care should be used in interpretation when the final k estimate is sensitive to the initial k(0) value.
Both the corrected and uncorrected estimates were similar to those found by Shah et al. (1999b) (
s) using RFLP markers with a Bonferroni correction to limit the experimentwise error rate (Table 1). With the exception of 1000-kernel weight when k(0)
2, both
and
adj were within one gene of
s (Table 1), and for five of the seven traits, our estimates were the same or smaller than
s based on Shah et al. (1999b). Our estimates are based on the assumption of no linkage between loci. At the present state of knowledge, there is some debate about the likelihood of linkage in this particular application. However, if some pairs of loci are in coupling phase and the others are unlinked then both
and
adj will underestimate the true number of loci. Coupling phase linkage may be a reasonable assumption with these data since all positive alleles were contributed by WI at the loci detected in Shah et al. (1999b). It is also important to recognize that even if all genes are unlinked, the type of variation observed in the RICLs could also be explained by a genetic model based on a very large number of loci (Mulitze and Baker, 1985). This result adds further justification to considering
and
adj to be lower bounds of the estimated number of loci with genes affecting the difference between the two parental lines.
For anthesis date and 1000-kernel weight, both
and
adj estimates were larger than
s based on Shah et al. (1999a, 1999b) (anthesis date:
s = 1,
= 1.64,
adj = 1.61; 1000-kernel weight:
s = 0,
= 0.69,
adj = 0.66). For anthesis date, our estimates did not strongly contradict those of Shah et al. (1999a, 1999b) since the standard error for anthesis date was large (0.31) and the evidence against the single-gene hypothesis was not strong (P > 0.01). The Bonferroni correction for
s may have been too stringent for 1000-kernel weight since the data indicated a genetic difference between CNN and CNN(WI3A) because the mean 1000-kernel weight differed between CNN and CNN(WI3A) (P < 0.02) and 38% of the RICLs differed from CNN, but
s was 0.
Compared with some commonly used methods based on molecular markers, there are some clear advantages to using
adj (or
) in estimating the number of segregating loci that differ between wheat cultivars using RICLs. Only field data are required to obtain
adj (or
), giving substantial cost savings over molecular marker estimates, which require genomic DNA analyses. In addition, since field data must be used for both molecular marker estimates and
adj estimates (or
), there can be large statistical errors associated with classifying the RICLs. Use of
adj explicitly accounts for both Type I and Type II statistical errors of misclassification of the RICLs via Eq. [3]. Some of the most commonly used approaches to analyzing molecular markers do not adequately account for misclassification errors resulting in questionable estimates of gene numbers. One common method is to conduct a single factor analysis of variance (ANOVA) on the RICLs' trait means with the molecular marker (present or absent) as the classification variable. These tests are then conducted for each marker separately to identify markers that are significantly associated with the trait. The error variance used in these ANOVAs contains among-RICL variance not associated with the marker of interest, with the likely consequence of overestimating error variance, which may result in failure to detect important markers. In addition, this procedure does not include an initial test of significance among RICLs. If this among-RICL variation is nonsignificant, then using multiple ANOVAs to identify significant markers, without somehow controlling overall experimentwise Type I error, may result in a substantial overestimate of the number of markers related to the trait of interest. Thus, an inflated error variance coupled with an uncontrolled experimentwise Type I error results in unknown levels of actual Type I and Type II errors. Type I errors appeared to have more of an impact on gene number estimates in Shah et al. (1999b)(Table 1) assuming
adj (or
from our Table 1) are more accurate. For some traits they considered, the estimated number of marker loci, without using the Bonferroni correction for experimentwise Type I error, was two to three times larger than the estimates based on the Bonferroni correction for experimentwise error.
The strategy of using field data and
adj is a cost-effective method of estimating the number of genes responsible for the difference between a substitution line and a parent. When assumptions are met, the method gives reasonable, unbiased estimates of the number of genes. The method can also be applied to data from other breeding plans such as the inbred-backcross approach. However, it is important to recognize that the method is based on a number of critical assumptions. The observed genetic variation is implicitly assumed to be caused by only a few genes even though genetic models based on a large number of loci could explain the data equally well. Failure of this assumption will cause
adj to be an underestimate (Mulitze and Baker, 1985). In addition, even if all assumptions hold, classification errors may cause the final
adj estimate to depend on the initial k(0) values, resulting in situations where both large and small k explain the data equally well. Results in this study are based on the assumptions that loci are either unlinked or in coupling phase. It is not clear how the presence of both coupling and repulsion phase linkage would affect the estimates. Finally, the method requires the standard assumptions that the traits are normally distributed and the appropriate alpha level is 0.05. It is not clear how violation of either of these assumptions affects the estimates.
| ACKNOWLEDGMENTS |
|---|
| NOTES |
|---|
|
|
|---|
Received for publication April 23, 1999.
| Appendix |
|---|
|
|
|---|
). (Using the binomial distribution, the probability that a line has j loci with positive alleles is kCj/2k and this line will have mean value µp + j
. Thus, the mean values for the lines will range from µp (parent mean) to µp + k
(= µs, the substitution line mean). To find the mean difference between the nonparental lines and the parent (µnp - µp), it is necessary to obtain the expected value of the nonparental lines (µnp). Recall that the nonparental lines are the lines that have at least one locus with a substitution line allele (Ai). The probability distribution of the nonparental lines may be obtained by finding a constant, c, such that
. Using properties of summations of combinations, c = 2k/(2k - 1), and the probability that a nonparental line has j loci (j = 1, ... , k) with substitution line alleles (Ai) is [2k/(2k - 1)]kCj/2k. Using this probability distribution for the nonparental lines with their means (µp + j
), the expected value of the nonparental lines is µp + k[2k-1/(2k - 1)]
. Since there are k loci that differ between the substitution and the parent lines, µs - µp = k
. Substitution of
= (µs - µp)/k gives µnp - µp = [2k-1/(2k-1)](µs - µp). This difference is then used as the true difference in computing ß.
SAS statements to estimate uncorrected (
) and corrected (
adj) estimates of the number of loci with genes affecting the difference between the parental and substitution lines for several traits
data a; input tn trait$ r m_r beta ; alpha=.05;
m = r+m_r; d=1-beta-alpha; p=r/m; output;
if p< alpha then do; r=.5; p=r/m; trait=trim(trait)||`adj';output; end;
if p> 1-beta then do; r=49.5; p=r/m; trait=trim(trait)||`adj'; output; end; cards;
1 yld 5 45 .000
2 snt 0 50 .000
3 tk 19 31 .049
4 till 2 48 .000
5 tst 0 50 .000
6 ht 18 32 .011
7 hd 34 16 .014
proc print;
*************** standard estimators *****************;
proc genmod order=data; class trait;
model r/m = trait / dist=bin noint;
fwdlink link = log(1-_mean_)/log(.5);
invlink ilink= 1 - exp(log(.5)*_xbeta_);
*************** corrected estimators ****************;
data b; set a;
if p<alpha or p>1-beta then delete;
proc print;
proc genmod order=data; class trait;
model r/m = trait / dist=bin noint;
fwdlink link = log(1-(_mean_- alpha)/d)/log(.5);
invlink ilink= d*(1 - exp(log(.5)*_xbeta_)) + alpha;
run;
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
B. T. Campbell, P. S. Baenziger, K. S. Gill, K. M. Eskridge, H. Budak, M. Erayman, I. Dweikat, and Y. Yen Identification of QTLs and Environmental Interactions Associated with Agronomic Traits on Chromosome 3A of Wheat Crop Sci., July 1, 2003; 43(4): 1493 - 1505. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| The SCI Journals | Agronomy Journal | Vadose Zone Journal | |||
| Journal of Natural Resources and Life Sciences Education |
Soil Science Society of America Journal | ||||
| Journal of Plant Registrations | Journal of Environmental Quality |
The Plant Genome | |||