|
|
||||||||
A.M. Casa, S.E. Mitchell, M.T. Hamblin, and S. Kresovich, Institute for Genomic Diversity, Cornell Univ., Ithaca, NY 14853; J.D. Jensen and C.F. Aquadro, Dep. of Molecular Biology and Genetics, Cornell Univ., Ithaca, NY 14853; A.H. Paterson, Plant Genome Mapping Laboratory and Comparative Grass Genomics Center, Univ. of Georgia, Athens, GA 30602. DNA sequences were deposited in GenBank under accession numbers DQ459071, DQ462793DQ463100
* Corresponding author (amc56{at}cornell.edu).
| ABSTRACT |
|---|
|
|
|---|
| INTRODUCTION |
|---|
|
|
|---|
Population genetics theory predicts that intense directional selection, as would be experienced during crop domestication, is expected to dramatically reduce variation at the genomic target of selection and at linked neutral loci, a phenomenon known as genetic hitchhiking (Maynard Smith and Haigh, 1974). Following a selective sweep, new mutations arising in the selected region initially result in a skew in the site frequency spectrum (i.e., excess of rare alleles). Selection might also lead to genetic differentiation as a consequence of allele frequency shifts between selected and nonselected populations (e.g., a domestication-associated allele will quickly increase in frequency in the cultivated populations). In a genome-wide scan of diversity at neutral markers such as SSRs, loci that show unusual patterns of allelic variation relative to genome-wide averages (i.e., locus-specific reductions in diversity, excess of rare alleles, or increased population differentiation) may be linked to targets of selection. Genome-wide scans of diversity have been used in this manner to identify candidate genomic regions in organisms such as humans, Drosophila, Arabidopis, and maize (Zea mays L.) (Vigouroux et al., 2002; Kauer et al., 2003; Kayser et al., 2003; Aranzana et al., 2005). Once a candidate region has been identified, it may be possible to identify the target by surveying adjacent genomic regions and looking for a return to neutral patterns of variation (Schlotterer, 2003).
In a genome-wide scan of diversity at neutral markers such as SSRs, loci that show unusual patterns of allelic variation relative to genome-wide averages may be linked to targets of selection.
Sorghum bicolor, a tropical grass probably domesticated in eastern Africa 3000 to 6000 years ago (Kimber, 2000), is the fifth most important grain crop worldwide (FAO, 2004). Because of its ability to tolerate drought, soil toxicities, and temperature extremes more effectively than other cereals including maize, grain sorghum is a pillar of food security in the semiarid zones of western and central Africa. Sorghum's global socioeconomic importance has prompted substantial interest in characterizing levels of genetic diversity using molecular markers (Dje et al., 2000; Grenier et al., 2000; Menz et al., 2004). More recently, studies of DNA sequence diversity (Hamblin et al., 2004, 2005, 2006) have indicated that sorghum has both lower nucleotide diversity and more extensive linkage disequilibrium (LD) than maize. Compared with more distantly related rice (Garris et al., 2003), however, sorghum has less extensive LD.
Recently, an SSR-based genome-wide scan of diversity in S. bicolor identified several loci with patterns of variation deviating from neutral expectations (Casa et al., 2005), but data were not sufficient to determine whether the apparent signal of selection resulted from a true selective event or from demographic factors such as population bottlenecks or mating system. For example, bottlenecks can produce locus-specific effects that resemble the effects of directional selection (Thornton and Andolfatto, 2006), and the degree of genetic differentiation between populations is usually higher in self-pollinating species than in outcrossers, independent of selection (Hamrick and Godt, 1996).
Here, we sequenced a bacterial artificial chromosome (BAC) clone containing the previously identified candidate SSR locus, Xcup15, which exhibited the highest genetic differentiation between wild and cultivated S. bicolor (Fst = 0.76) (Casa et al., 2005). We also collected and analyzed DNA sequence data from a panel of 17 cultivated and 13 wild sorghum accessions to determine if patterns of variation in this region of the S. bicolor genome show evidence of a domestication-related selective sweep.
| Materials and Methods |
|---|
|
|
|---|
DNA was extracted from single BAC clones and used as templates in PCRs to confirm the presence of locus Xcup15. From these clones, a single BAC, c0156b06, was selected for complete DNA sequencing because of its central position on the S. bicolor physical map (www.genome.arizona.edu, verified 5 May 2006) relative to the location of Xcup15 and to the other BACs evaluated.
Randomly sheared libraries with inserts ranging from 1.0 to 4.0 kb were constructed and shotgun sequencing was performed with pGEM-T (Promega Corporation, Madison, WI) vector primers by MWG-Biotech (Ebersberg, Germany). Sequence reads were generated to
8-fold coverage and assembled using Sequencher (Gene Codes Corporation, Ann Arbor, MI) followed by visual inspection of chromatograms. The DNA sequence of BAC clone c0156b06 was deposited in the National Center for Biotechnology Information (NCBI) nucleotide database (GenBank) under accession number DQ459071.
BAC Annotation and Sequence Comparisons
Genes were predicted using the Rice Genome Automated Annotation System (http://ricegaas.dna.affrc.go.jp/, verified 5 May 2006). This system utilizes several gene prediction programs including FGENESH (trained with monocot sequences), GENESCAN (trained with Arabidopsis or maize sequences), and the Rice Hidden Markov Model (RiceHHM). We annotated genes only where all prediction programs were in agreement. We also queried against the rice genome sequence (www.gramene.org, verified 5 May 2006) and the NCBI protein (nr), nucleotide (nt), and the expressed sequence tag (EST) databases. Predicted gene sequences were considered to be expressed if they were at least 99.8% similar to S. bicolor ESTs or EST consensus sequences. This criterion for sequence similarity was determined by pairwise comparisons of nucleotide diversity (
) observed in coding regions of cultivated sorghum (average
= 0.0020 or about one single nucleotide polymorphism, SNP, every 500 bp) (Hamblin et al., 2006). Repetitive sequences were identified by searching against both the Poaceae RepBase (www.girinst.org, verified 5 May 2006) and the TIGR Gramineae Repeat Database (http://tigrblast.tigr.org/euk-blast/index.cgi?project=osa1, verified 5 May 2006). PipMaker (Schwartz et al., 2000) was used both to align DNA sequences from rice BAC clone OSJNBa0003A09 (GenBank accession AC118132), identified by similarity searches above, and S. bicolor BAC c0156b06 and to generate sequence identity and dot plots.
Diversity Analysis
Plant Material
DNA sequences around Xcup15 were collected from 30 S. bicolor accessions including both cultivated (subsp. bicolor) (n = 17) and wild (subsp. arundinaceum) (n = 13) lines and a weedy relative, S. propinquum (Table 1). These accessions, comprising all S. bicolor subspecies and races, were chosen to maximize geographic distribution, morphological variation, and genetic diversity as assessed by variation at 74 SSR loci (Casa et al., 2005). This sampling strategy was devised to minimize the effects of population structure on tests of selection. Seeds from cultivated material (landraces) were obtained either from the National Center for Genetic Resources Preservation (USDA-ARS, Ft. Collins, CO) or the Plant Genetic Resources Conservation Unit (USDA-ARS, Griffin, GA), and seeds from wild accessions were provided by Mitchell R. Tuinstra (Agronomy Department, Kansas State University). Sorghum propinquum leaves were obtained from the Plant Genome Mapping Laboratory (University of Georgia). Information on geographic origin and racial classification was gathered primarily from the System-wide Information Network for Genetic Resources database (http://singer.cgiar.org/Search/SINGER/search.htm, verified 5 May 2006).
|
DNA Sequence Analysis
Summary statistics including levels of diversity based on both the average number of nucleotide differences per site between two sequences (
) and number of segregating sites (
), interspecific divergence, and Fst, were calculated using DnaSP v. 4.0 (Rozas et al., 2003). Insertiondeletion polymorphisms were excluded from these analyses. Three statistics were employed to evaluate deviations from the neutral, equilibrium model:
(i) The HKA test (Hudson et al., 1987) was used to compare ratios of polymorphism to divergence for sampled regions assuming a neutral model (i.e., no selection). Each locus was tested against a reference locus comprised of pooled data from 204 loci (Hamblin et al., 2006). For intraspecific polymorphism the following parameters were used: S = 1075, N = 16, and L = 138243, where S is the number of variable sites, N is the sample size, and L is the total number of nucleotide sites surveyed in a sample of cultivated sorghum. For interspecific divergence we used K = 1948 and L = 136626, where, K is the average number of differences between cultivated S. bicolor and S. propinquum and L is the number of nucleotide sites evaluated. A Bonferroni correction was applied to account for multiple comparisons.
(ii) Tajima's D (Tajima, 1989) was employed to test for an excess of rare alleles. Following a selective sweep, new mutations arise in the selected region resulting in a skew in the distribution of nucleotide polymorphisms (site frequency spectrum). The population bottleneck associated with sorghum's domestication is, however, expected to affect the site frequency spectrum genome-wide; in particular, the variance of D will be much larger than under a neutral equilibrium model. Critical values of D were obtained from coalescent simulations of a simple bottleneck model that produces the same average number of segregating sites and the same average D as was observed in a genome-wide survey of variation in cultivated sorghum, and in which most of the parameters were estimated based on independent data (Hamblin et al., 2006): the average ancestral population mutation parameter (4Neµ) was fixed at 3.8 based on variation in wild S. bicolor; the population recombination parameter (4Ner) was fixed at 0.01 bp (Hamblin et al., 2005). The time of the bottleneck was 0.025(4Ne) generations ago, which would correspond to about 14000 generations ago if all our assumptions were correct (although this is considerably longer ago than is suggested by archeological data, namely 3000 to 6000 years ago, more recent bottlenecks were incompatible with the observed average value of D). Assuming that the size of the current population and the ancestral population are the same, the intensity of the bottleneck (the size of the bottlenecked population relative to its duration) required to produce the observed value of S was 2.1, equivalent to a 128-fold reduction in population size. The distribution of D values generated by 10000 simulations of this model had a 95% confidence interval of 1.96, +2.35.
(iii) The CLR test (Kim and Stephan, 2002) was employed for detecting directional selection along a recombining chromosome. This test compares the likelihood of observed patterns of DNA sequence variation under a selective sweep model compared with a neutral equilibrium model of evolution. The CLR test was also used to generate maximum likelihood estimates (MLEs) of the location of the putative selected site (X) and the strength of selection (
= 2Nes). The following parameters were used: NCD = 0 (number of coding regions), Rn = 0.023 [scaled recombination rate (4Ner) per nucleotide, where r = 4 x 108 (Hamblin et al., 2005) and 4Ne = 570000 (Hamblin et al., 2006)],
1 (Watterson's estimate of
from data; Watterson, 1978), Nrepl 1 (number of replicates), LBs 1 and RBs 100250 (left and right boundaries on the candidate region where beneficial mutation might be located), and intX 1000 (interval between initial guesses of X). Recombination rate was assumed constant across the region and
was estimated from the data in order to make the CLR test conservative (Kim and Stephan, 2002). The frequency of the beneficial allele was set to 1. This method assumes the selected site was fixed very recently. Only accessions for which DNA sequences were available for all loci were included in the analysis (see Table 1). Variable sites were coded as either ancestral (0) (if the nucleotide at the variable position was shared with S. propinquum) or derived (1).
Distinguishing between Positive Selection and Demographic Factors
A goodness-of-fit (GOF) test (Jensen et al., 2005) was performed for discriminating whether CLR test rejections were due to selection or to nonequilibrium demographic effects. To determine significance, GOF values obtained from our polymorphism data were compared with those estimated from 1000 data sets simulated under a selection scenario using the maximum likelihood parameter estimates of the location and intensity of selection from the CLR test. In this way, given that the dataset has rejected neutrality in favor of selection, the GOF sets the CLR test selection model as the null and determines whether the sweep model explains the data well, or whether the data simply poorly fit a neutral, equilibrium model.
| Results and Discussion |
|---|
|
|
|---|
200000 S. bicolor ESTs in the public domain (www.ncbi.nlm.nih.gov/; verified 5 May 2006), these do not capture all expressed genes.
|
|
7.7 kb of DNA sequence obtained from each individual was from noncoding DNA). In only one instance (locus 7) was coding sequence solely analyzed. The candidate SSR, Xcup15, resides within locus 9b (Tables 3 and S-1).
Levels of within and between species variation (diversity and divergence, respectively) in the sampled region are shown in Table 3. Cultivated sorghums were invariant at six of the 10 loci and average nucleotide diversity (
) was 0.0008 (range was 0.00.0071), a considerably lower estimate than obtained in a previous study of other genomic regions in the same sorghum accessions (average
was 0.0023) (Hamblin et al., 2006). In general, levels of diversity based on the number of segregating sites (
) were lower than those based on
(Table 3). Notably, locus 9-10b was unusually diverse. This locus, from an intergenic region rich in miniature inverted repeat transposable elements (MITEs) between the PP2C gene and a predicted protein, accounted for most (
90%) of the variation detected within cultivated sorghum.
|
(0.0027) and
(0.0031) were similar and about three times higher than in cultivated sorghum. Accession L-WA15 was heterozygous at three loci and three wild samples had a MITE insertion within locus 9-10a (data not shown). As observed in cultivated lines, locus 9-10b exhibited the highest levels of variation (Table 3). Notably, a
1 kb transposon-like insertion was observed within locus 9b (which includes SSR locus Xcup15) of S. propinquum (outgroup). This insertion was absent in all cultivated and wild accessions.
Diversity and divergence trends for cultivated and wild sorghums within the genomic region containing Xcup15 are shown in Fig. 2
. Directional selection is expected to reduce levels of diversity in cultivated relative to wild sorghum around the selection target. Previous values based on genome-wide estimates of nucleotide diversity have indicated that cultivated accessions exhibit about two-thirds the diversity observed in wild material (Hamblin et al., 2005). In the Xcup15 region, however, cultivated lines were even less diverse, showing one-third the diversity of wild accessions. The contrast is very striking, however, when polymorphism data for locus 9-10b, an extreme outlier with similar polymorphism levels in both cultivated and wild sorghums (see above), were excluded from the analysis. Here, the amount of variation in cultivated sorghum was only 5% of that observed in wild accessions. The magnitude of this reduction in diversity is comparable with that reported for domestication-related genes in maize. In contrast to genome-wide estimates that indicate that maize contains
57% of the variability found in its progenitor (Wright et al., 2005), the promoter regions of the maize teosinte branched1 (tb1) (Doebley et al., 1995) and teosinte glume architecture1 (tga1) (Dorweiler et al., 1993) alleles possess 3% (Wang et al., 1999) and 5% (Wang et al., 2005), respectively, of the variation observed in wild relatives, the teosintes. Both tb1 and tga1 have been shown to be targets of domestication-related selection in maize (Wang et al., 1999, 2005).
|
Fst measures the level of genetic differentiation between populations (here, cultivated and wild sorghums) based on allele frequencies. Under a scenario of directional selection in cultivated sorghum, Fst values are expected to be higher at the selection target and adjacent loci, but diminish with distance as recombination prevents the unusual differentiation associated with selection from occurring. Although the average value of Fst observed across the entire Xcup15 region (0.15) is comparable with a previous estimate based on genome-wide SSR data (Fst = 0.13) (Casa et al., 2005), loci corresponding to the third intron (locus 9a) and the 5' UTR (locus 9b) of the PP2C gene revealed a considerably greater degree of differentiation (0.52 and 0.46, respectively) (Fig. 2 and Table 4). Thus, the Fst analysis suggests that selection may have occurred in or near loci 9a and 9b (the PP2C gene).
|
Second, simulation studies have shown that LD increases after a selective sweep (Przeworski, 2002; Kim and Nielsen, 2004). A particular haplotype (extending for at least 99 kb) predominated among cultivated sorghums while wild accessions showed no such haplotype structure (Fig. 3
). Previous estimates in sorghum have indicated that LD decays, on average, by 15 kb (Hamblin et al., 2005). Although low levels of polymorphism in the Xcup15 region precluded our ability to assess LD levels, the haplotype structure in cultivated sorghums was unusual and resembled that observed in swept regions of other species. For example, DNA sequence data from maize, a randomly mating outcrossing species (Brown and Allard, 1970), have suggested that selection produces higher LD. In a survey of six genes (1.210 kb in length) in a diverse set of tropical and semitropical lines of maize, Remington et al. (2001) found that LD declined rapidly (within 2001500 bp) for five genes but that it decayed much more slowly (within
10 kb), for sugary 1 (su1). Subsequent analysis showed that su1, an enzyme in the starch biosynthesis pathway, had been under directional selection during either domestication or breeding (Whitt et al., 2002). Extended LD has been also been detected around the maize allele of the Y1 gene that encodes for yellow endosperm (Palaisa et al., 2003). In rice, nucleotide diversity data surrounding the xa5 locus, a bacterial blight resistance gene, showed significant LD between sites 100 kb apart for resistant accessions but no significant association among susceptible types (Garris et al., 2003). Rice, like sorghum, is a predominantly selfing species although outcrossing rates in rice (<1%) (Rong et al., 2004) are much lower than estimates for sorghum (530%) (Ollitraut, 1987; Doggett, 1970).
|
A transition between wild (including S. propinquum) and cultivated accessions at position 56122 bp of the BAC clone (corresponding to the 5' UTR of the PP2C gene) and
105 bp upstream of SSR Xcup15 (Fig. 3). Previous analysis of variation across a total of 23174 bp (Hamblin et al., 2005) never yielded a fixed difference between DNA sequences from wild and cultivated sorghums. Moreover, DNA sequence alignment of this region to sequences from sugarcane, maize, and rice indicated that these taxa exhibit the same nucleotide (G) observed in the wild sorghums at position 56122 bp, confirming that the A allele in the cultivated is derived. The serinethreonine phosphatase (PP2C gene) that harbors this fixed transition was most similar to Arabidopsis thaliana gene At3g51370 and belongs to one of the largest gene families described in plants. According to Kerk et al. (2002), Arabidopsis contains 69 such genes. Moreover, the PP2C Arabidopsis homolog is a member of the least studied groups of phosphatases, class D (Schweighofer et al., 2004). Serinethreonine phosphatases have been implicated in mechanisms such as abscisic acid (ABA) signal transduction, regulation of flower development (Schweighofer et al., 2004) and seed germination (Yoshida et al., 2006). Two sorghum domestication-related QTLs co-localize with Xcup15, one for plant height (Lin et al., 1995) and the other for primary branch number in the inflorescence (P.J. Brown, 2006, personal communication). Although the prospects are tantalizing, we have no evidence at present that the PP2C gene does or does not influence any of these phenotypes in S. bicolor. The high level of LD (haplotype structure) in the cultivated lines should also lead to caution in the acceptance of the PP2C gene as being the actual target of selection without additional functional and/or association studies (see below).
Statistical Evidence for Selection
We employed statistical methods to determine if the patterns of diversity observed in cultivated sorghums in the genomic region surrounding Xcup15 differed significantly from an equilibrium neutral model and in a manner consistent with a selection scenario in cultivated sorghum. Directional selection (i.e., fixation of a favorable mutation) will result in decreased variation at linked neutral regions and the size of the affected region is a function of both the regional rate of recombination and the strength of selection. To test if differences among loci in the amount of diversity within species relative to divergence between species were significant we employed the HKA test. Because the amount of DNA sequence variation observed within a species (diversity) is expected to be proportional to the amount of DNA sequence divergence between species at neutrally evolving loci (Kimura, 1983), significant differences in these ratios might suggest the local effects of selection. If a particular locus shows a low ratio of diversity to divergence relative to other loci, for example, directional selection may have been responsible for the reduced diversity and the locus possibly encodes or influences a domestication-related trait. Conversely, higher diversity than expected under a neutral evolution model might indicate the effects of balancing or diversifying selection (the locus could be involved in local adaptation or crop improvement). Results from HKA tests for the 10 loci surveyed are presented in Table 4. Among the comparisons performed for cultivated sorghum (each of 10 loci vs. a "reference locus" composed of genome-wide data) (see Materials and Methods), only locus 9b (the same locus that showed the fixed nucleotide difference between cultivated and wild sorghums) exhibited a significant P value (0.0009) after applying the Bonferroni correction. This finding indicates a deficiency of polymorphism in cultivated lines relative to divergence and is consistent with expectations under a model of recent directional selection. None of the HKA tests performed on loci from the wild accessions were significant (Table 4). Although not ideal, comparison of wild data to the cultivated reference locus was carried out due to the lack of an appropriate reference dataset derived exclusively from the wild sorghums, and is conservative for detection of directional selection.
Another feature of the sequence data that can be used to infer the action of selection is the frequency distribution of polymorphisms. Assuming no recombination, a selective sweep of a new mutation or unique variant eliminates all linked neutral variation. With time, as the population recovers from the sweep, new mutations will accumulate, initially at low frequencies. This skew towards an excess of rare variants is measured by Tajima's D, which compares the difference between two measures of diversity,
w and 
. The
w estimate (
in Table 3) is based on the number of segregating sites and is, therefore, affected mostly by low frequency variants, while 
(
in Table 3) is based on average nucleotide diversity and is mostly influenced by intermediate frequency alleles. Because the means of these two estimators are expected to be equal under neutrality (see Fay and Wu, 2005), significantly negative values of D are consistent with directional selection whereas significantly positive values are consistent with balancing selection. Results for Tajima's D (Table 4) indicated a predominance of low-frequency polymorphisms in both cultivated (average D = 0.45) and wild (average D = 0.47) sorghums. Although these results are in the direction expected under a directional selection scenario, none of the loci (D ranged from 1.50 to +1.35) differed significantly from expectations under either an equilibrium neutral model or a simple bottleneck model. Therefore, the Tajima's D results provide no evidence for a recent selective sweep of a single new or unique variant. This result is not surprising considering that the power of this test for detecting a selective sweep is restricted within a fairly narrow time interval following the sweep (Simonsen et al., 1995).
Unlike the previous tests, in which loci are tested individually, the likelihood-based statistical test or CLR evaluates the significance of a local reduction of variation along a physically linked but not necessarily contiguous stretch of DNA (see Materials and Methods). Departure from neutrality is, therefore, tested with sequence data from all loci simultaneously. Moreover, the CLR estimates the strength and location of directional selection from DNA sequence data. We tested polymorphism data for the cultivated and wild groups separately and also for the combined dataset to evaluate species-wide patterns. Results from this composite likelihood analysis rejected the neutral equilibrium model in favor of a strong selective sweep or hitchhiking model (MLE of the strength of selection or
= 10087) only in the combined data set. When population size (Ne) is set to 142500 (see Materials and Methods), the MLE of
suggests a selection coefficient (s) of 0.035. This value of s is similar to those obtained for the tga1 (s = 0.030.04) (Wang et al., 2005) and tb1 (s = 0.040.08) (Wang et al., 1999) genes of maize. As indicated above, both loci have been shown to be targets of domestication-related selection in maize (Wang et al., 1999, 2005). In addition, the CLR test located the target of selection at position 26107 bp of the BAC clone sequence (between genes 2 and 3) (Table 2) and
30 kb upstream of the fixed transition (at 56122 bp) observed between wild and cultivated sorghums (see above). Except for multiple transposable element-related coding sequences, the region containing the predicted target comprises the longest expanse of DNA containing no predicted genes (Table 2). It is worth noting, however, that simulation studies have recently demonstrated that the MLE of the target of selection is less reliable in partially sequenced regions, having a very large relative mean square error relative to estimates based on complete sequence (J.D. Jensen, 2006, personal communication). In order to quantify this result, 95% confidence intervals were calculated via parametric bootstrap and were seen to encompass
39% of the total region, between positions 6487 and 45722. To improve precision of our localization, therefore, we would need to collect contiguous DNA sequence polymorphism data from across the entire 99 kb sample region (a very significant sequencing effort).
Distinguishing Selection from Demographic Factors
Results from the CLR test indicated that patterns of diversity in this region of the sorghum genome are a better fit to a selective sweep model than to an equilibrium neutral model. This test, however, is not robust to undetected population structure or a recent bottleneck (Jensen et al., 2005), processes that can generate large deviations from equilibrium and patterns of sequence variation that resemble those expected under a selection scenario. For example, an alternative interpretation of the diversity patterns observed for cultivated and wild sorghums (Fig. 2) could involve demographic amplification of ancestral stochastic variation via a population bottleneck associated with cultivation. Alternatively, this pattern could represent a preexisting sweep signal (i.e., selection occurred in the wild sorghums and was amplified in cultivated lines through one or more bottlenecks) (see Pool et al., 2006).
To address this issue, we took the maximum likelihood estimates from the CLR test and employed them in the GOF which has been shown to have high sensitivity for discriminating between a hitchhiking model and nonequilibrium demography (Jensen et al., 2005). Results from the GOF test suggest that the hitchhiking model fits the data poorly (P = 0.12; the lower the value the worse the fit) and, therefore, the signal detected by the CLR method can not be distinguished from demography. We should note, however, that other factors might account for the poor fit observed with the GOF test. First, Jensen et al. (2005) have indicated that deviations from a simple selection model (one that assumes a single, recent, and complete sweep) can generate a large
GOF (and therefore a small P value), even if selection has taken place. Additionally, joint analysis of the wild and cultivated data artificially created population structure (see Fst results, Table 4), which has been shown to lead to false positives with the CLR test (Jensen et al., 2005). Furthermore, the sweep model (Kim and Stephan, 2002) assumes that the data are sampled from a random mating population at equilibrium. Sorghum, however, is a predominantly selfing species and it is not a population in equilibrium (Hamblin et al., 2005, 2006). While the GOF test appears to be robust to violations of a number of these assumptions in Drosophila (Jensen et al., 2005), the effects of these violations are as of yet unexplored in a species such as sorghum. Thus, the results of the CLR test should be viewed only as being consistent with, and not evidence for, recent strong selection in this region of the sorghum genome.
Implications for Identifying Targets of Directional Selection
The power to detect directional selection is directly proportional to the amount of within-species diversity. That is, higher levels of variation provide more power for detecting significant reductions in variation likely associated with selection (Wright et al., 2005; Yamasaki et al., 2005; Hamblin et al., 2006). Cultivated sorghum exhibits one-fourth of the amount of genetic variation observed in a comparable sample of geographically and genetically diverse maize landraces (Hamblin et al., 2004, 2005). Therefore, the low levels of diversity observed within sorghum, coupled with the relatively low divergence to the outgroup (S. propinquum), represent major factors limiting our ability to unambiguously determine the target of selection in this genomic region.
When employing genome-wide scans of diversity to identify signals of selection, there are both advantages and disadvantages associated with having extensive haplotype structure or LD. For example, species with fairly extensive LD such as rice and sorghum require lower marker density for suitable genome coverage compared with species in which LD decays much more rapidly (e.g., maize). Conversely, extensive haplotype structure also hinders exact localization of the selection target. Because one major haplotype was observed along the 99 kb Xcup15 region of cultivated sorghum, at least 12 predicted genes (1, 3, 4, 5, 6, 7, 8, 9, 10, 13, 14, and 15; Table 2) should be considered as potential selection candidates. Given that we were unable to establish the precise boundaries of this putative sweep, genes outside this range are possible candidates as well, despite the evidence of a fixed derived mutation at the PP2C noted in the previous section.
The number of genes that one needs to consider as selection candidates will also depend on the interplay between recombination and genome organization. A major difference between the genome organization of maize and sorghum (as well as rice and Arabidopsis) is the interspersion patterns of genes and repetitive sequences. Sorghum and rice have very compact genomes (
772 and 470 Mb, respectively, Arumuganathan and Earle, 1991; Goff et al., 2002) and gene density tends to be high (Goff et al., 2002; Kim et al., 2005). Gene density in maize, on the other hand, is much lower (SanMiguel et al., 1996; Tikhonov et al., 1999), with genes separated by large blocks of highly methylated repetitive elements (Bennetzen et al., 1994) that are recombinationally suppressed. For example, although LD extends for up to 90 kb upstream of tb1 (Clark et al., 2004), a gene that has played a major role in the morphological transition from teosinte to maize (Doebley et al., 1995) and is the best documented target of strong directional selection in plants (Wang et al., 1999), tb1 is the only gene present within this range. The remaining 90 kb upstream region is composed almost entirely of transposable elements.
Implications for Association Studies and Future Directions
Genome-wide scans of diversity performed in highly diverse panels of maize have yielded dozens of candidates associated with domestication and/or crop improvement (Vigouroux et al., 2002; Wright et al., 2005; Yamasaki et al., 2005). The success of population genetics-based approaches in maize, therefore, prompted us to evaluate this methodology applied to a selfing species as a way of identifying targets of directional selection. As this study reveals, DNA sequence polymorphism data support our initial findings based on SSR genome-wide scans of diversity (Casa et al., 2005) that recent directional selection likely shaped diversity patterns around locus Xcup15. Thus, as has been shown in maize, population genetics-based approaches can also lead to the successful identification of candidate genomic regions in sorghum. However, the domestication process in sorghum may not have been as simple as it apparently has been in maize (see Matsuoka et al., 2002). While we assume a single, recent, and complete sweep, it is possible that the history of cultivated sorghum was complex and involved multiple domestication events and/or postdomestication gene flow between wild and cultivated sorghum.
This study has also revealed that unambiguous identification of the target of directional selection in sorghum might not be as straight forward as it presumably has been in maize, because of the overall low levels of variation, more extensive LD, and other departures from equilibrium in sorghum (Hamblin et al., 2006). This challenge might also be faced when such studies are conducted in species that exhibit genomic characteristics and mating systems similar to sorghum. As with the genomic signatures of directional selection, we do not really know what the signal of diversifying selection (pertaining to traits such as flowering time, plant height, and disease resistance) will look like in sorghum. From a practical point of view, however, use of directed (i.e., starting from traits of interest instead of random scans of diversity) and integrated approaches (i.e., combining population development, QTL mapping, and assessment of variation in diversity panels) should pave the way for the successful identification of functionally interesting alleles for crop improvement and line development in S. bicolor.
| ACKNOWLEDGMENTS |
|---|
| NOTES |
|---|
|
|
|---|
Received for publication February 1, 2006.
| REFERENCES |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| The SCI Journals | Agronomy Journal | Vadose Zone Journal | |||
| Journal of Natural Resources and Life Sciences Education |
Soil Science Society of America Journal | ||||
| Journal of Plant Registrations | Journal of Environmental Quality |
The Plant Genome | |||