Crop Science Illumina
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Published online 1 November 2006
Published in Crop Sci 46:S-55-S-61 (2006)
© 2006 Crop Science Society of America
677 S. Segoe Rd., Madison, WI 53711 USA
This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Jackson, S. A.
Right arrow Articles by Grimwood, J.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Jackson, S. A.
Right arrow Articles by Grimwood, J.
Agricola
Right arrow Articles by Jackson, S. A.
Right arrow Articles by Grimwood, J.

ACTIVITIES & RESOURCES

Toward a Reference Sequence of the Soybean Genome: A Multiagency Effort

Scott A. Jackson*, Dan Rokhsar, Gary Stacey, Randy C. Shoemaker, Jeremy Schmutz and Jane Grimwood

S.A. Jackson, Dep. of Agronomy, Purdue Univ., 915 W. State St., West Lafayette, IN 47906; D. Rokhsar, JGI Production Genomics Facility, 2800 Mitchell Dr., Walnut Creek, CA 94598; G. Stacey, National Center for Soybean Biotechnology, Div. of Plant Sci. and Biochemistry, Dep. of Molecular Microbiology and Immunology, Univ. of Missouri, Columbia, MO 65211; R.C. Shoemaker, Corn Insect and Crop Genetics Research Unit, Ames, IA 50011; J. Schmutz and J. Grimwood, Stanford Human Genome Center, Dep. of Genetics, Stanford Univ. School of Medicine, 975 California Ave., Palo Alto, CA 94304

* Corresponding author (sjackson{at}purdue.edu).


    ABSTRACT
 TOP
 NOTES
 ABSTRACT
 INTRODUCTION
 Physical Framework
 Genome Shotgun Sequence
 Integration of WGSS with...
 Finishing, Annotation, and...
 The Postgenomic Future of...
 REFERENCES
 
The face of soybean [Glycine max (L.) Merr.] genetics is set to change with the imminent genome sequence delivered by a triagency group [National Science Foundation (NSF), United States Department of Energy (USDOE), and United States Department of Agriculture (USDA)]. The approaches and alacrity with which scientists will be able to solve biological questions and advance breeding lines will be dramatically enhanced. Questions remain though. How and in what form will the genome be sequenced? How will the genome sequence be linked to genetic and physical maps and how will all this information be accessible for biologists and breeders? In this article, we show how the genome is being sequenced and how various groups and agencies are working together to ensure that the sequence is immediately available and of use to soybean researchers.


    INTRODUCTION
 TOP
 NOTES
 ABSTRACT
 INTRODUCTION
 Physical Framework
 Genome Shotgun Sequence
 Integration of WGSS with...
 Finishing, Annotation, and...
 The Postgenomic Future of...
 REFERENCES
 
MOST CROP LEGUMES belong to two major sister lineages; the Hologalegina and the Phaseoloids (Lavin et al., 2005). The Hologalegina lineage includes both model legumes, Lotus and Medicago, currently undergoing genome sequencing. The Phaseoloids, containing the economically important Glycine, Phaseolus, and Vigna, separated from the Hologalegoids around 50 million years (MY) ago (Lavin et al., 2005) but shared a common ancestor and common genome duplication event immediately before this separation (Pfeil et al., 2005). Because of the 50 MY separating Phaseoloid legumes from the two models (Lotus and Medicago), the legume research community recommended that soybean be developed as the model genome for the Phaseoloid legumes (Gepts et al., 2005) because of its economic importance, moderate genome size, and existing infrastructure. Interestingly, one unexpected byproduct is that soybean is an excellent model for examining the genomic consequences of genome duplication and/or polyploidy.

Soybean has a moderately complex and sized genome of {approx}1100 Mbp (Arumuganathan and Earle, 1991) packaged into 20 chromosome pairs. It is thought to have undergone two to three rounds of genome duplication and/or polyploidization during the last {approx}45 MY (Shoemaker et al., 1996; Blanc and Wolfe, 2004; Schlueter et al., 2004). The most recent of these large-scale duplications may have occurred a mere 1 to 3 MY ago, thus some duplicated blocks are often highly similar at the sequence level (R.C. Shoemaker and S.A. Jackson, 2006, unpublished data). Although chromosome-level homeology does exist (Walling et al., 2006), multiple rounds of duplication followed by reshuffling has resulted in a mosaic genome—one that is highly duplicated, with regions that are either highly conserved, or highly rearranged (J. Schlueter and R.C. Shoemaker, 2006, unpublished data).

Even though the genome has undergone several rounds of large-scale genome duplication (Arumuganathan and Earle, 1991; Shoemaker et al., 1996; Schlueter et al., 2004) and possesses a large percentage (40–60%) of repetitive sequences (Goldberg, 1978; Gurley et al., 1979), its major structure is still discretely defined. It appears to be organized such that low copy sequences (euchromatin) occupy much of the chromosome arms, while high copy sequences (heterochromatin) are sequestered to the centromeric and pericentromeric regions (Lin et al., 2005; Walling et al., 2006). In fact, we know already that approximately one-half of the {approx}500 Mbp repetitive DNA in soybean is contained either in the centric, telomeric, or nucleolar organizing regions (N. Gill and S.A. Jackson, 2006, unpublished data).

To develop soybean as a genomic model for Phaseoloid legumes, a physical mapping effort, initially funded by the United Soybean Board (USB) and now funded by the NSF, was launched in 2005. This initiative was followed a short time later in January 2006 by the joint announcement by the USDA and the USDOE, of a whole-genome shotgun sequencing (WGSS) effort to be undertaken at the Joint Genome Institute (JGI). The timing was fortuitous: a physical map is necessary to assemble genome shotgun sequences onto chromosomes and linkage maps so that the outcome is a resource that is easily used by the research community. Here we report the status of the physical mapping and sequencing efforts and how these two efforts are being jointly coordinated to result in a high quality, reference genome sequence for the model legume, soybean.


    Physical Framework
 TOP
 NOTES
 ABSTRACT
 INTRODUCTION
 Physical Framework
 Genome Shotgun Sequence
 Integration of WGSS with...
 Finishing, Annotation, and...
 The Postgenomic Future of...
 REFERENCES
 
Although much of genome sequencing has moved in the direction of shotgun sequencing, an underlying physical framework is often necessary to begin to link the sequence to the underlying chromosomes and genetic maps. Moreover, to begin to link genetic variation (genetic linkage maps) with the WGSS sequence—which is the ultimate goal in a practical sense to begin to clone genes—a physical map that is integrated with the linkage map is necessary. In soybean, three BAC (bacterial artificial chromosome) libraries made with different restriction enzymes (HindIII, BstYI, and EcoRV; Marek and Shoemaker, 1997; Marek et al., 2001) are publicly available and included in the BAC-based physical map. Often, BAC libraries have representational biases (i.e., underrepresentation of certain genomic regions). The inclusion of three different restriction enzymes should, to some extent, ameliorate that bias.

Portions of all three BAC libraries ({approx}5x physical coverage of each) are, or have been, fingerprinted using the snapshot methodology (Luo et al., 2003). This approach results in more information in the form of restriction fragments that can be used to assemble contigs of overlapping BACs using FPCs (fingerprinted contigs) (Soderlund et al., 1997; Nelson et al., 2005). Given the level of genome duplication in soybean, this approach to physically mapping the genome will be superior to other methods (Nelson et al., 2005). The assembled fingerprints provide an immediately useful tool for gene cloning and understanding of genome organization. The contigs, with associated genetic markers, are available at SoyBase (Grant and Shoemaker, 2006).

To make the FPC map useful for helping to assemble WGSS data and more immediately for cloning genes, it has to be tied to the genetic linkage map. The soybean genetic linkage map is one of the most densely populated maps among plants, with more than 2007 published markers (Grant and Shoemaker, 2006). More than 1060 single-nucleotide polymorphisms (SNPs) have also been mapped but not yet published (P. Cregan, 2006, personal communication). Many of these markers, including single sequence repeats (SSRs), restriction fragment length polymorphisms (RFLPs), and SNPs, include sequence information that are being used to associate BAC clones in FPCs to genetic locations. In addition to placing genetic markers on the physical map, expressed sequence tag (EST) derived sequences are being mapped concurrently. As part of the USB and NSF-funded projects, EST sequences are being placed onto BACs through overgo hybridizations. Currently, more than 600 of the published genetic markers have been associated with BACs. These correspond to 1881 BAC contigs (Soybean Physical Mapping Team, 2006, unpublished data) (Fig. 1 ). By the end of 2008, an additional 2000 genetic markers will be placed on the physical map. The distribution of overgos on the physical map is already providing an estimate of the location of gene-dense regions in the soybean genome (Jackson et al., 2006, unpublished data). Some BACs from these regions will be targeted for BAC by BAC sequencing as a subset of the JGI sequencing effort.


Figure 1
View larger version (28K):
[in this window]
[in a new window]

 
Fig. 1. cMap view of bacterial artificial chromosome (BAC) contigs integrated with genetic map with QTL. From right to left, light bars are QTL, LG_A1 equals section of linkage group A1, solid bar connected by thin lines to green loci is WmContig3588, green loci are genetic markers that anchor this contig, and green and black bars are BACs (green means has one or more overgo or genetic marker associated, black means none). Circles next to each BAC name indicate which and what kind of marker identify that BAC (No. of BACs hit referenced by OG1, only 2 hit OG2, or OG3).

 
To tie the sequence map to the genetically integrated physical map, sequence information from the end of each BAC clone in the physical map is being extracted in the form of BAC end sequences (BES). This will result in a sequence tag connector (STC) database consisting of BES comprising many megabase pairs of soybean sequence (Table 1). Two BAC libraries have been completed, and the remaining library will be done by the end of 2006. A contig with 50 BAC clones, for example, would have {approx}100 STCs in the form of BES that can be linked to WGSS contigs or scaffolds to tie the physical and sequence maps together. An earlier NSF project provided for the development of a preliminary infrastructure of BAC contigs, anchored to the genetic map (Marek et al., 2001). In that study, 389 SSR and 223 RFLP markers were used, with some RFLP probes identifying as many as 150 BACs from multiple locations. This latter fact probably reflects the ancient polyploidy of the soybean genome.


View this table:
[in this window]
[in a new window]

 
Table 1. Summary of soybean BAC libraries and status of BAC end sequencing and fingerprinting.

 

    Genome Shotgun Sequence
 TOP
 NOTES
 ABSTRACT
 INTRODUCTION
 Physical Framework
 Genome Shotgun Sequence
 Integration of WGSS with...
 Finishing, Annotation, and...
 The Postgenomic Future of...
 REFERENCES
 
The soybean genome is currently being sequencing using a WGSS similar to the approach used to sequence poplar (Populus trichocarpa Torr. & A. Gray; http://genome.jgi-psf.org/Poptr1/Poptr1.home.html). A WGSS relies on a manyfold redundancy of sequencing to computationally reassemble the genome. In practice, this redundancy can be as much as five- to eightfold coverage (Venter et al., 2001). In addition to sequence redundancy, WGSS approaches also rely on a variety of library insert sizes to give paired reads that are a few kilobases apart, intermediately spaced ({approx}10 and 30–40 kb), and BAC sized paired reads of >100 kb. The multiplicity of insert sizes allow reassembly algorithms to take advantage of the known distance between paired reads to assemble sequence contigs (contiguous sequence) into scaffolds (contigs joined by sequence paired reads without intervening sequence information) (Fig. 2 ).


Figure 2
View larger version (20K):
[in this window]
[in a new window]

 
Fig. 2. Schematic of genome mapping, genetic and physical, and integration of whole-genome shotgun sequences (WGSS). A typical chromosome (ideogram) is shown at left of figure, immediately to the right is a linkage map with units of centimorgans. Genetic markers on the linkage map with sequence information can be used to anchor bacterial artificial chromosomes (BACs) or BAC contigs to the genetic map (integrated map). Individual BACs (third layer from right) are assembled into contigs based on fingerprint information (FPC map). Traditionally, overlapping BACs are used to make a BAC tiling path for sequencing (i.e., yellow BAC clones). End sequences from BACs (BES, orange balls on end of BACs) are used to provide large spanning clones for assembling WGSS and, in the interim, to provide genetic markers to further anchor BAC contigs. A WGSS is made in the form of end-sequenced randomly sheared clones of various sizes (colored bars at right of figure). Paired sequences are computationally assembled into contigs (completely overlapping sequence information) and then scaffolds (ordered contigs where clones span sequence gaps). The BES and sequence-based markers are used to anchor sequence contigs and scaffolds to the BAC-based physical and genetic maps.

 
To benchmark the WGSS approach, several large, megabase-sized, contiguous regions will be sequenced using overlapping BACs from the physical map. These will provide benchmarks to assess how well and at what coverage the WGSS approach is working. Within these large contiguous regions, the duplicated regions will also be targeted for BAC-based sequencing to determine how duplicated sequences will confound the WGSS assembly due to sequence conservation between the homeologous segments (Schlueter et al., 2006). Even though improper assembly of duplicated genome segments that had ≥97% sequence identity was problematic (She et al., 2004), recent improvements in WGSS assembly algorithms have to some extent overcome these limitations. Given what we currently know about the level of sequence conservation in the most recently duplicated regions of the soybean genome, this should result in few misassembled regions due to high sequence conservation. During the course of the soybean shotgun sequencing, builds will be attempted to determine the level of coverage, representational biases in sequencing and/or cloning, and the difficulties that may arise due to either repetitive DNA sequences or recently duplicated sequences that can confound assembly.


    Integration of WGSS with the Physical Framework
 TOP
 NOTES
 ABSTRACT
 INTRODUCTION
 Physical Framework
 Genome Shotgun Sequence
 Integration of WGSS with...
 Finishing, Annotation, and...
 The Postgenomic Future of...
 REFERENCES
 
One challenge in WGSS approaches is that the resulting sequence is often fragmentary and not associated with chromosomes and/or linkage groups. Therefore, the pursuit of a soybean physical map that is anchored to the genetic map will assist in making the WGSS as biologically informative as possible by providing the links to the linkage groups and underlying chromosomes. To integrate the WGSS and the physical map, two sequence-based resources will be used: (i) end sequenced BAC clones, providing sequence links between the two datasets; and (ii) sequence-based genetic markers placed on the physical providing links between the linkage groups, physical map, and sequence map (Fig. 2). The merging of all these datasets should result in a rich resource that plant biologists can use to characterize the soybean genome, clone genes, and enhance soybean breeding.

One remaining question is what level of sequence completion does the community desire? A previous community meeting indicated a strong desire to have soybean be the reference genome for Phaseoloid legumes (Gepts et al., 2005). The WGSS approaches may or may not result in what is often considered reference quality. Therefore, some level of completion post-WGSS may be necessary, such as that being done for poplar. But, the counter argument is that if most of the genome is captured in the current WGSS approach, should further resources be expended finishing the genome instead of capturing genomic information from other legume crops? Or even spending some of those resources annotating, functionally characterizing, and placing the data into a user-friendly database? Another alternative, based on what we currently understand about genome organization in soybean, is to focus on finishing the euchromatic arms only. We suspect that a majority of the nearly 40% of the genome that is repetitive is compartmentalized in the pericentromeric regions, thus targeted finishing of the euchromatic arms would represent approximately one-half of the 1100-Mb genome.


    Finishing, Annotation, and Databasing of the Genome Sequence
 TOP
 NOTES
 ABSTRACT
 INTRODUCTION
 Physical Framework
 Genome Shotgun Sequence
 Integration of WGSS with...
 Finishing, Annotation, and...
 The Postgenomic Future of...
 REFERENCES
 
The greatest challenges following the WGSS of soybean will be the finishing of the gene-rich regions, as described above, and the annotation of genes, repeats, and promoter elements. Although well over 350 000 EST sequences are available in GenBank, relatively few full-length cDNA sequences are available. Full-length cDNA sequences have proven to be extremely valuable in the annotation of the Arabidopsis and Oryza genomes (Seki et al., 2002; Borevitz and Ecker, 2004; Yuan et al., 2005). Thus, there needs to be a concerted public effort to produce more high-quality, full-length cDNA sequences.

In order for the sequence to be of any use, it must be publicly available with associated data [i.e., gene annotation, genetic markers, quantitative trait loci (QTL)]. Since soybean has been selected as a reference genome for the Phaseoloid legumes (Gepts et al., 2005), this data must be present in a cross-legume format such that biologist interested in related legume species can efficiently leverage the data. The physical map is already hosted at SoyBase (Grant and Shoemaker, 2006), a community genetic mapping resource specific to soybean. The Legume Information Service (LIS) (Gonzales et al., 2005) appears to be the best venue to maintain and update the genome sequence and disseminate it to the broader legume community. The LIS already hosts the Medicago genome sequence and integrates it with other legume-specific (and nonlegume species) sequence data. The LIS is best suited to handle and display these emerging legume genome sequences in a biologically informative manner that will be of use to legume biologists and geneticists.


    The Postgenomic Future of Soybean
 TOP
 NOTES
 ABSTRACT
 INTRODUCTION
 Physical Framework
 Genome Shotgun Sequence
 Integration of WGSS with...
 Finishing, Annotation, and...
 The Postgenomic Future of...
 REFERENCES
 
Although sequencing the soybean genome will be a formidable task, it is only a beginning—a tool to enable studies of soybean biology. As previously outlined (Stacey et al., 2004), soybean has a number of resources available to leverage the genomic sequence. For example, efforts are underway to generate a variety of reverse genetic tools in soybean including TILLING (McCallum et al., 2000), virus-induced gene silencing (Zhang and Ghabrial, 2006), and transposon mutagenesis (T. Clemente, K. Wang, Z. Zhang, R.C. Shoemaker, and G. Stacey, 2006, unpublished data). RNAi induced gene silencing is also well established in soybean, using both stable transformation either through biolistics or Agrobacterium tumefaciens (Reddy et al., 2003; Lim et al., 2005; Subramanian et al., 2005; Nunes et al., 2006) and hairy root transformation (mediated by A. rhizogenes, C. Taylor and G. Stacey, 2006, unpublished data). These methods will be greatly aided by knowledge of the genome and ORFeome.

Laboratories are already envisioning multiple uses of the genomic sequence. Among these, the use of modern resequencing methods (Thomas et al., 2006; Velicer et al., 2006) to analyze the genomes of cultivars other than ‘Williams 82’ for studies of gene diversity and expanding SNP discovery. These approaches will rapidly expand the few thousand SNPs currently available that facilitate gene cloning, haplotyping, and diversity and QTL analyses. No doubt genome tiling arrays (Hazen and Kay, 2003; Mockler et al., 2005) and other more sophisticated approaches to study gene expression (e.g., chromatin immunopreciptiation; Borevitz and Ecker, 2004) will soon follow.

When the soybean genome sequence is available, it will be the third legume species, in addition to Medicago truncatula Gaertn. and Lotus japonicus (Regel) K. Larsen (Young et al., 2005), for which full genome sequences will be available. This will place legumes in a special position of having three mostly complete genomes, facilitating phylogenetic and evolutionary studies within this family. This will be particularly advantageous for determining the role of genome duplication (polyploidization) in the evolution and domestication of this important crop. Among the important questions that can be addressed are the unique symbioses that legumes have with N-fixing bacteria, as well as their N-rich lifestyle; both of which are of significant agronomic importance.


    ACKNOWLEDGMENTS
 
We would like to acknowledge the generous funding of the United Soybean Board, National Science Foundation (DBI 0501877), United States Department of Agriculture-Agriculture Research Service, and the United States Department of Energy.


    NOTES
 TOP
 NOTES
 ABSTRACT
 INTRODUCTION
 Physical Framework
 Genome Shotgun Sequence
 Integration of WGSS with...
 Finishing, Annotation, and...
 The Postgenomic Future of...
 REFERENCES
 
Abbreviations: BAC, bacterial artificial chromosome; BES, bacterial artificial chromosome end sequences; EST, expressed sequence tag; FPC, fingerprinted contig; JGI, Joint Genome Institute; LIS, Legume Information Service; MY, million years; NSF, National Science Foundation; QTL, quantitative trait locus; RFLP, restriction fragment length polymorphism; SNP, single-nucleotide polymorphism; SSR, single sequence repeat; STC, sequence tag connector; USB, United Soybean Board; USDA, United States Department of Agriculture; USDOE, United States Department of Energy; WGSS, whole-genome shotgun sequencing.

Received for publication August 8, 2006.


    REFERENCES
 TOP
 NOTES
 ABSTRACT
 INTRODUCTION
 Physical Framework
 Genome Shotgun Sequence
 Integration of WGSS with...
 Finishing, Annotation, and...
 The Postgenomic Future of...
 REFERENCES
 





This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Jackson, S. A.
Right arrow Articles by Grimwood, J.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Jackson, S. A.
Right arrow Articles by Grimwood, J.
Agricola
Right arrow Articles by Jackson, S. A.
Right arrow Articles by Grimwood, J.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
The SCI Journals Agronomy Journal Vadose Zone Journal
Journal of Natural Resources
and Life Sciences Education
Soil Science Society of America Journal
Journal of Plant Registrations Journal of
Environmental Quality
The Plant Genome