Crop Science Journal of Natural Resources and Life Sciences Education
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Published online 18 December 2007
Published in Crop Sci 47:S-32-S-43 (2007)
© 2007 Crop Science Society of America
677 S. Segoe Rd., Madison, WI 53711 USA
This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Google Scholar
Right arrow Articles by Beavis, W. D.
Right arrow Articles by Baxter, S. M.
PubMed
Right arrow Articles by Beavis, W. D.
Right arrow Articles by Baxter, S. M.
Agricola
Right arrow Articles by Beavis, W. D.
Right arrow Articles by Baxter, S. M.
Related Collections
Right arrow Bioinformatics
Right arrow Functional Genomics
Right arrow Biometrics

Translational Bioinformatics: At the Interface of Genomics and Quantitative Genetics

William D. Beavis*, Faye D. Schilkey and Susan M. Baxter

National Center for Genome Resources, 2935 Rodeo Park Dr. East, Santa Fe, NM 87505. Funded, in part, by NIH-NIAID HHS200400064C, NSF BDI-0516487, and USDA-ARS SCA 58-3625-2-109

* Corresponding author (wdbeavis{at}agron.iastate.edu).


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 Methods and Materials
 Results
 Discussion
 Conclusions
 APPENDIX
 REFERENCES
 
Genomics and bioinformatics are expected to revolutionize crop improvement. We have to admit, however, that while information from the various "omics" is being used by developmental and evolutionary biologists, the information is not being used routinely by translational plant biologists or applied plant breeders. This is due to a failure to provide information from omics technologies in formats that can be used to develop more effective and efficient assays that the breeder can use for selection. A similar situation exists in biomedical research where more efficacious therapies will be realized if omics information is translated into biomarker-based diagnostics. Because data are still lacking in plant science, herein we describe the development of an integrated web-based system that supports translational research using a biomedical example that can serve as a model for translational plant research. We also discuss how the bioinformatic system is agnostic with respect to data content and is capable of accepting omics data that eventually will be generated by plant biologists for use by plant breeders to develop diagnostic biomarkers for use in selection.

Abbreviations: CGL, candidate genetic loci • CMTV, Comparative Map and Trait Viewer • GEYSIR, Genome Exploration and Survey of Immune Response • KEGG, Kyoto Encyclopedia of Genes and Genomes • LD, linkage disequilibrium • LIS, Legume Information System • NCBI, National Center for Biotechnology Information • NCGR, National Center for Genome Resources • NIAID, National Institute of Allergy and Infectious Disease • NIH, National Institutes of Health • OMIM, Online Mendelian Inheritance in Man • PI, principal investigator • QTL, quantitative trait loci • SNP, single nucleotide polymorphism • SSWAP, Simple Semantic Web Architecture and Protocol • TEAM, A Tool for the Integration of Expression and Linkage in Association Maps • XML, extensible markup language



    ACKNOWLEDGMENTS
 
We wish to express our gratitude to the many members of the immune response population genetics project team and to an anonymous reviewer who provided a number of helpful suggestions.

Received for publication August 7, 2006.

Translational Bioinformatics: At the Interface of Genomics and Quantitative Genetics

William D. Beavis*, Faye D. Schilkey and Susan M. Baxter

National Center for Genome Resources, 2935 Rodeo Park Dr. East, Santa Fe, NM 87505. Funded, in part, by NIH-NIAID HHS200400064C, NSF BDI-0516487, and USDA-ARS SCA 58-3625-2-109

* Corresponding author (wdbeavis{at}agron.iastate.edu).

Genomics and bioinformatics are expected to revolutionize crop improvement. We have to admit, however, that while information from the various "omics" is being used by developmental and evolutionary biologists, the information is not being used routinely by translational plant biologists or applied plant breeders. This is due to a failure to provide information from omics technologies in formats that can be used to develop more effective and efficient assays that the breeder can use for selection. A similar situation exists in biomedical research where more efficacious therapies will be realized if omics information is translated into biomarker-based diagnostics. Because data are still lacking in plant science, herein we describe the development of an integrated web-based system that supports translational research using a biomedical example that can serve as a model for translational plant research. We also discuss how the bioinformatic system is agnostic with respect to data content and is capable of accepting omics data that eventually will be generated by plant biologists for use by plant breeders to develop diagnostic biomarkers for use in selection.

Abbreviations: CGL, candidate genetic loci • CMTV, Comparative Map and Trait Viewer • GEYSIR, Genome Exploration and Survey of Immune Response • KEGG, Kyoto Encyclopedia of Genes and Genomes • LD, linkage disequilibrium • LIS, Legume Information System • NCBI, National Center for Biotechnology Information • NCGR, National Center for Genome Resources • NIAID, National Institute of Allergy and Infectious Disease • NIH, National Institutes of Health • OMIM, Online Mendelian Inheritance in Man • PI, principal investigator • QTL, quantitative trait loci • SNP, single nucleotide polymorphism • SSWAP, Simple Semantic Web Architecture and Protocol • TEAM, A Tool for the Integration of Expression and Linkage in Association Maps • XML, extensible markup language


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 Methods and Materials
 Results
 Discussion
 Conclusions
 APPENDIX
 REFERENCES
 
Plant breeding is about selection: selection of crosses and selection among their progeny using (multiple) heritable assays. So what does the "omics" revolution and bioinformatics have to offer plant breeders (Thro et al., 2004)? To enhance plant breeding, various omics resources need to be translated into heritable assays that are more effective and efficient than phenotypic assays currently employed by plant breeders. Such translational research in plant biology has yet to be developed.

Until recently translational biomedical research was similarly lacking, that is, omics data and bioinformatics were not producing efficient, effective diagnostics and subsequent therapies. Thus, medical clinicians have not used emerging genomics-based knowledge in bedside practice. Beginning in 2003, the National Institute of Allergy and Infectious Disease (NIAID), one of the National Institutes of Health (NIH), began to address the gap between basic and applied medical research by funding efforts to translate structural and functional genomic information into useful diagnostics. Herein we present a translational biomedical research project that is focused on development of DNA-based biomarker diagnostics, describe how translational bioinformatics can provide the necessary tools to identify these diagnostic biomarkers, and discuss how such an approach can be used for translational plant science.

When the human genome sequencing effort began, there was considerable debate within the biomedical research community about whether this fundamental information would enable the NIH to meet its mission of developing useful diagnostics and therapeutics for human health (Baxevanis, 2001). While the completion of a canonical human genome sequence has not directly created diagnostic or therapeutic applications, it is recognized that the complete genome sequence, coupled with automated annotation based on powerful gene modeling algorithms, has greatly enhanced the ability to locate candidate genetic loci (CGL). Subsequent map-based cloning and resequencing of alleles at CGL from affected and unaffected individuals (i.e., association genetics) has revealed the allelic basis of over 1200 simple Mendelian diseases (Botstein and Risch, 2003). Further bioinformatic analyses have revealed that over 80% of these diseases result from deletions or insertions and missense mutations in exons. These mutations in the translated DNA sequence have enabled development of DNA-based diagnostic tests for these simply inherited diseases, even though therapies for most still await better understanding of the molecular mechanisms that link genotypes to phenotypes.

As with the human genome, the full genome sequence from Arabidopsis and "gene-space" sequences from reference plant species such as rice (Oryza sativa L.), medicago (Medicago truncatula Gaertner), poplar (Populus trichocarpa L.), maize (Zea mays L.), and soybean [Glycine max (L.) Merr.] have been combined with gene modeling algorithms to annotate the sequences. Functional annotation is proceeding with both computational methods (Kriventseva et al., 2005) and through experimental approaches (http://www.nsf.gov/pubs/2005/nsf05624/nsf05624.htm; verified 15 Nov. 2007). The advantage of functional genomic studies in Arabidopsis and reference plant species relative to humans is that mutations can be induced and evaluated for major Mendelian phenotypes. The disadvantage is that these induced mutants cannot be used directly for diagnostic tests in crop species, unless the crop species is also a reference species (e.g., rice). Nonetheless, bioinformaticists have leveraged this information to develop comparative genomics analyses based on homology among plant species. These methods have been incorporated into automated structural and functional annotations for genes in crop species (e.g., see the TIGR Gene Indices, http://www.tigr.org/tdb/tgi/plant.shtml; verified 15 Nov. 2007). Identification and structural annotation of sequences based on gene models is still not sufficient, however, for use in selection of complex and quantitative traits because the breeder needs the sequence for the allelic variant associated with the desirable trait to develop a high throughput, heritable assay.

The successful identification of allelic variants responsible for simple Mendelian traits prompted biomedical researchers to evaluate the use of association genetics to identify the allelic basis of variability in complex and quantitative traits. Ideally, they would like to do this on a whole genome basis because of the inherent bias in the candidate gene approach. Unfortunately, to have reasonable power to find the allelic associations with diseases of moderate relative risks could require assays of millions of common single nucleotide polymorphisms (SNPs) on thousands of individuals (Botstein and Risch, 2003). Even the considerable budgets of NIH cannot support such studies with current technologies. As an alternative, biomedical researchers have pursued association genetic studies of complex diseases using prior knowledge about CGL. The candidate gene approach has been successful in identifying allelic variants associated with of a large number of complex traits including asthma, Alzheimer disease, breast cancer, stroke, and hyperlipidemia, (e.g., see Chasman et al., 2004; Drysdale et al., 2000; Gretarsdottir et al., 2003; Judson et al., 2004; Lohrisch and Piccart, 2000, 2001a, 2001b; Poirier et al., 1995; Rogaeva et al., 2007; Steinthorsdottir et al., 2007; Winkelmann et al., 2003).

The general candidate gene approach consists of (i) identification of CGL, followed by (ii) an association genetics study, and (iii) validation of any statistically significant associations. For purposes of this manuscript we will focus on translational bioinformatics to facilitate identification of CGL. To identify CGL the translational researcher needs to integrate information derived from different types of data from unrelated experiments. At the very least this could require literature studies in genetics, molecular biology, biochemistry, and physiology. With the emergence of numerous high throughput biotechnologies (omics data), the researcher is also faced with the need to aggregate and integrate data from multiple databases and websites. For plant species additional data based on syntenic relationships among genetic maps may be necessary to leverage sequence information from reference species to the crop species of interest (Gonzales et al., 2007). From a translational research perspective, the data and information must be presented and communicated in an understandable format for the translational researcher interested in subsequent development of DNA-based assays. Given the familiarity and ubiquitous use of web-based browsers by translational researchers, a further requirement is for user interfaces to be implemented in a system easily accessed with standard web-based browsers.

Since the necessary omics data resources in crop species are still being developed we will use a biomedical project to illustrate these points by describing a software system that we recently developed for biomedical researchers interested in identification of CGL for development of DNA-based diagnostics in human infectious diseases.

Background of a Translational Research Project
During the last 10 to 15 yr, a great deal has been learned about the human immune system. From the literature, we are aware of about 1000 genes that are likely involved in immune and allergic responses and we know many of the genetic networks and pathways through which these genes operate (Hunter and Reiner, 2000; Harty et al., 2000; Takeda et al., 2003). For this particular project we have been studying the relationship between the innate and adaptive immune systems. Specifically, we are interested in the impact of polymorphisms in the Toll receptor genes of dendritic cells and their impact on signal transduction pathways affecting T cells and their subsequent Th1 and Th2 responses to infection (see Nishimura [2001] for these pathways). Effective vaccines depend on proper functioning of this system to produce a Th1 cell response. Th1 cells are "programmed" to recognize protein epitopes from the vaccine. Once programmed, Th1 cells produce pro-inflammatory cytokines that stimulate destruction of microbial pathogens. In some individuals, a vaccine will illicit a Th2 cell response causing production of a class of cytokines that cause hyperallergenic and inflammatory reactions that can lead to severe disease and death. Whether or not genetic variants are associated with these different cellular responses and the corresponding clinical syndromes are currently unknown.

In 2004 the NIAID requested proposals to develop diagnostic biomarkers that are predictive of unfavorable immunological responses to infections and vaccines. Twenty-five years ago, smallpox vaccines were given routinely and as many as one in a million vaccinations resulted in death and at least 1 in 50 thousand vaccinations resulted in adverse reactions (Henderson, 1996). Since 2001, there has been serious debate about whether society will accept such high rates of negative side effects should the smallpox vaccine have to be readministered (Lane and Goldstein, 2003). Additionally, vaccines for influenza can cause severe allergic reactions that occasionally result in death. To address this challenge the National Center for Genome Resources (NCGR) partnered with deCODE Genetics (Reykjavik, Iceland) and the University of New Mexico Health Sciences Center (Albuquerque) to develop DNA-based biomarkers that are predictive of unfavorable responses to smallpox and influenza vaccines. An applied clinical goal of the project is a rapid biomarker assay that can be used to advise people seeking vaccination about their relative risk for an unfavorable reaction.

Although the specific genes and phentoypes of this project are not relevant to plant breeders, the higher level applied goal is. In order for plant breeders to take advantage of omics information they will need "field kits" consisting of DNA-based biomarkers that will enable them to rapidly predict the response of varieties to treatments, whether those are herbicide treatments, pathogen attacks, or drought. As with the human immune response system, almost all plant responses are complex and polygenic, thus the candidate gene approach to identifying DNA-based biomarkers associated with complex traits will be the same.


    Methods and Materials
 TOP
 ABSTRACT
 INTRODUCTION
 Methods and Materials
 Results
 Discussion
 Conclusions
 APPENDIX
 REFERENCES
 
Requirements for a software system to identify CGL based on multiple data types from multiple sources of information were based on the following use-case: A researcher would like to identify potential CGL based on (i) their genomic alignment with likelihood surfaces from multiple linkage disequilibrium (LD) studies, (ii) structural and functional annotations, and (iii) all known allelic variants within the introns and exons of the CGL. Note that the results of this use-case will provide the translational researcher with information needed to decide which potential allelic variants should be used in subsequent association genetics tests. A further requirement from the translational researchers, clinicians, and NIH included the ability to present information in a single view using a dynamic interactive user-interface easily accessible by commonly used web browsers.

Software Development and Project Management
The Agile project management process (Schwaber, 2004) was used for biweekly software releases. This included iteration planning, sometimes referred to as sprint planning, involving scientists and software developers and daily 15-min status meetings, or Scrums, by the software development team. The technical aspects of software system architecture and software practices are described in the Appendix.

Data and Information Sources
During the iterative planning process the project team identified multiple sources of data and information involving genetic factors related to human immune responses. An immune response candidate gene list, based on immune response literature, was curated and reviewed by the project principal investigators (PIs). Genetic maps and microsatellite marker sets were contributed by collaborators at deCODE Genetics (Kong et al., 2002). Human chromosome sequence maps, genes and SNPs were retrieved from the National Center for Biotechnology Information's (NCBI's) RefSeq project Build 35 v.1:chromosome, sequence maps from the Nucleotide db (http://www.ncbi.nlm.nih.gov/sites/entrez?db=nuccore/ verified 22 Oct. 2007), genes from EntrezGene (http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene/ verified 22 Oct. 2007), and sequence variants from dbSNP (http://www.ncbi.nlm.nih.gov/sites/entrez?db=snp/ verified 22 Oct. 2007). Supporting functional information resources such as literature citations are hyperlinked, to PubMed and Online Mendelian Inheritance in Man (OMIM), as well as pathway information from the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al., 2006) and BioCarta (http://www.biocarta.com/genes/allpathways.asp/ verified 22 Oct. 2007).

In addition to pre-existing data and information the project is generating its own experimental data. For example we will generate likelihood statistics on a whole genome basis from LD studies, on a haplotype basis from association genetics studies, and predicted phenotypes from dendritic cell–based expression arrays from independent samples in validation experiments.

We used the extensible markup language (XML), a technology neutral data interchange format, for data transfer and management. XML was chosen because of its ubiquitous parsing engine support. Defining what constitutes valid XML and what the XML should look like is explicitly declared using document type definition pages.


    Results
 TOP
 ABSTRACT
 INTRODUCTION
 Methods and Materials
 Results
 Discussion
 Conclusions
 APPENDIX
 REFERENCES
 
The product of the Agile process (described above) is a designed, developed, and implemented software system named Genome Exploration and Survey of Immune Response (GEYSIR; http://geysir.ncgr.org/ verified 22 Oct. 2007). Because of the rapid software project management and iteration process used in its development, GEYSIR became functional and operational within 6 mo of its inception. For example, project PIs have been able to use the system to design association genetic studies and validation tests through identification of CGL, their functional and structural annotations, and associated SNPs.

Use-Case Illustration
Upon entry to GEYSIR it is possible to observe any and all likelihood plots across the whole genome and these can be compared among multiple studies (Fig. 1a ). The ability to compare results from multiple studies and across the entire genome in a single view is enabled by converting the likelihood surface into a "heat map" in which the researcher has the ability to assign colors to the likelihood values generated by the statistical analyses. This capability was originally developed for translational researchers working with plant breeders at CGIAR centers through the comparative map and trait viewer system (Sawkins et al., 2004).


Figure 1
View larger version (63K):
[in this window]
[in a new window]

 
Figure 1. Screen shots from GEYSIR (a, b, d) and BioCarta (c) illustrating the ability to identify candidate genetic loci. Likelihood surfaces, illustrated as "heat maps," of the 22 human chromosomes from two retrospective linkage disequilibrium (LD) studies of response to vaccines or infections in a sample of families from the Icelandic population. Selection of chromosome one results in a horizontal heat map of chromosome 1. This is depicted as a heat map from one of the LD studies and is shown in the top "track." Vertical hatch marks on linkage and physical maps depict mapped simple sequence repeat (SSR) markers in the second track. Also depicted are representations of the location of all annotated genes in the third track. The location of these genes on the LD track and linkage and physical maps are shown with the "slider" bars that are used for navigation across the chromosome. Candidate genetic loci known to be associated with immunological responses are color coded (blue). Selection of one candidate genetic locus (CGL), IL12Rβ2, with a "control click" results in a BioCarta view of the supporting information from the literature and known pathways in which the gene participates. Currently these literature citations are hyperlinked to Online Mendelian Inheritance in Man (OMIM) and Medline, while pathways are hyperlinked to Kyoto Encyclopedia of Genes and Genomes (KEGG) and BioCarta. A view of the structural annotation of the gene IL12Rβ2 and the positions of allelic variants, currently in the form of single nucleotide polymorphisms (SNPs), are displayed relative to the structural annotation. A slider placed over the annotation and SNP tracks enables a view of the canonical sequence with known sequence variants displayed as "bubbles."

 
With a single left-click it is possible to observe the likelihood surface of a single linkage group and its associated genetic and physical maps (Fig. 1b). Thus, it is possible to observe the genomic regions associated with high likelihood values (quantitative trait loci [QTL] peaks) and ascertain whether the QTL occur in regions with large amounts of DNA per centimorgan of recombination. Further, the likelihood surface and genetic and physical maps have a graphical, user-driven, "slider," similar to that found on slide rules. By moving the slider with the mouse it is possible to view all genes that have been mapped to the physical location associated with the QTL. Genes with published immune responses are color coded (blue) for quick identification. In this particular example we began with a likelihood peak on chromosome 1 and identified the gene for the beta 2 subunit of the interleukin 12 receptor, that is, IL12Rβ2. IL12 induces the expression of IFN{gamma} through the JAK/STAT cell signaling pathway (Fig. 2 ) (Trinchieri et al., 2003) and note that its receptor, IL12Rβ2, is one of the candidate genes under the likelihood peak. With a "control click" on IL12Rβ2, or any genes associated with immune responses, it is possible to view the supporting information from the literature and observe the known pathways in which the gene participates (Fig. 1c).


Figure 2
View larger version (74K):
[in this window]
[in a new window]

 
Figure 2. Screen shot from BioCarta (http://www.biocarta.com/pathfiles/h_dcPathway.asp/ verified 22 Oct. 2007) illustrating dendritic cell signaling of Th1 and Th2 responses to infection. Note, the IL12-mediated response of the Th1 cell.

 
A single click on IL12Rβ2, or any gene of interest, will enable a view of the structural annotation of the gene (Fig. 1d). Also, the positions of allelic variants, usually in the form of SNPs, are displayed relative to the structural annotation. A slider placed over the annotation and SNP "tracks" enables a view of the canonical sequence with known sequence variants displayed as "bubbles." It is, thus possible for the researcher to quickly "drill" down to CGL associated with a QTL peak and identify all known allelic variants in exons, introns, and untranslated regions of the gene and learn whether these occur in "hot spots" of recombination. Further it is possible for the researcher to select allelic variants of CGL for the design of primers to be used in an association genetics test.

A variant on this use-case is to enter the system with a gene of interest and a desire to know if it is involved in immunologic response networks and whether it is implicated in project LD studies and to identify the various alleles for use in association studies. Once signed onto GEYSIR it is immediately possible to query the system for all curated genes involved with immune responses (Fig. 3 ). Selecting the gene of interest, for example, IL12Rβ2, generates a view of the location for the gene in the genome (Fig. 1b) with its associated genetic and physical maps. Subsequent steps of viewing the gene in the context of its pathways and identification of SNPs in the context of its structural annotation and sequence are the same as presented in Fig. 1c and 1d.


Figure 3
View larger version (41K):
[in this window]
[in a new window]

 
Figure 3. Screen shot taken from GEYSIR depicting the ability of the researcher to enter the system from the perspective of selecting a gene of interest. In this case the researcher has asked for all genes containing the characters "IL12" and subsequently chooses IL12Rβ2 for further information (see Fig. 2).

 
Additional Implementation Notes
Researchers interested in using the GEYSIR website will note that the Flash windows have a fixed size. This is not a limitation of Flash, rather it is due to our implementation of the software; it can be easily changed in the source code. The researcher will also note that the right click of a typical mouse does not work. Again this is an implementation decision that corresponds to a user community that works with MacIntosh computers; a CTRL-click should be used instead of the right click.


    Discussion
 TOP
 ABSTRACT
 INTRODUCTION
 Methods and Materials
 Results
 Discussion
 Conclusions
 APPENDIX
 REFERENCES
 
Initially the term bioinformatics was used to describe the processes of acquiring and analyzing sequence data from proteins and DNA molecules (Mount, 2001). As various omics technologies proved capable of obtaining data from whole genomes (Fleischmann et al., 1995), bioinformatics emerged as a discipline not only concerned with acquisition, assembly, and annotation of genome sequences, but also with all aspects of automated data collection, management, integration, analyses, and dissemination of data and information (Beavis, 2005). These activities have been enabled through development of imaging technologies, relational database management systems, novel analysis methods, the Internet, and the worldwide-web technologies (Baxevanis, 2001).

From the biologist's perspective, one of the more important activities of bioinformatics is to aggregate and integrate disparate data and information resources into a single system. There are numerous approaches to integration (Siepel et al., 2001a, 2001b) and many have been used to develop plant information databases. As the model species for understanding plant developmental biology, Arabidopsis thaliana has been used to generate enormous amounts of omics data through the 2010 project (http://www.nsf.gov/pubs/2005/nsf05624/nsf05624.htm). The Arabidopsis Information Resource (TAIR, http://www.Arabidopsis.org/ verified 22 Oct. 2007) is the primary and most comprehensive plant information system with genetic, genomic, transcript (expressed sequence tag and microarray), pathway, and metabolomic data. MaizeGDB (http://www.maizegdb.org/ verified 22 Oct. 2007) provides a similar, although less comprehensive, resource for developmental biologists and geneticists working on maize. On the other hand, plant evolutionary biologists are served by systems that integrate a single data type, across multiple species. For example, PlantGDB (http://www.plantgdb.org/ verified 22 Oct. 2007) integrates DNA sequence data for purposes of supporting comparative genomics. And, a few plant information systems integrate information across both species and data-types in what are known as clade-oriented databases (Stein et al., 2006) such as Gramene (http://www.gramene.org/ verified 22 Oct. 2007) and the Legume Information System (LIS; http://www.comparative-legumes.org/ verified 22 Oct. 2007).

While these integrated systems are providing significant support for developmental and evolutionary plant biologists, they have not been designed for or used by translational plant researchers. Thus, omics information and knowledge are not being used by applied plant breeders. There are several reasons for this lack of translation including:

  1. Identification of genetic variants on a whole genome basis has not been cost effective for crop species, although with the emergence of second generation sequencing technologies this will change.
  2. Information from multiple sources has not been integrated and presented in a manner that is useful for the translational researcher.

Indeed, most of the data and information used by the biomedical project that motivated GEYSIR will come from publicly available websites; <15% will be generated by the project. Because it is difficult for the biomedical researcher to collect, manage, analyze, and interpret all of this data, especially when it is distributed among websites and project databases, the NIH has recognized the need for translational bioinformatics. Like most biomedical researchers, translational plant researchers will not have the skills and time to translate existing information into effective and practical knowledge if the information is not aggregated, integrated, and provided through intuitive interfaces.

We are aware of two previous projects that have addressed the issues of aggregating, integrating, and presenting omics data in a format useful for translational research: Comparative Map and Trait Viewer (CMTV) (Sawkins et al., 2004) and A Tool for the Integration of Expression and Linkage in Association Maps (TEAM) (Franke et al., 2004). While CMTV is extremely flexible and very powerful at integrating disparate sources of information, it is a system that was developed for purposes of exploration and discovery. Thus, it did not meet the requirement of presenting information in a single view with commonly used web browsers. TEAM was designed to address translational research goals and met many of the requirements of our biomedical research project. However, it did not meet the necessary requirement of presenting information in a single view using a dynamic interactive user-interface easily accessible by commonly used web browsers.

It should be emphasized that our current implementation of GEYSIR does not aggregate all data into a single database system. It relies on linkages to several sources of information, for example, OMIM, KEGG, and BioCarta and it is well known that publicly funded web resources can be volatile. Because of our close association with BioMOBY services (Wilkinson and Links, 2002), Semantic MOBY (http://biomoby.open-bio.org/index.php/semantic-moby/ verified 22 Oct. 2007) and the recently developed Simple Semantic Web Architecture and Protocol (SSWAP) (Gessler, 2008), a fusion of semantic BioMOBY and BioMOBY services, we think GEYSIR will be more effectively deployed using SSWAP as soon as these technologies become more mature.

The GEYSIR system illustrates that it is possible to rapidly design, develop, and implement a system to help the biomedical researcher mine multiple information resources through a single dynamic web-based interface. Equally compelling, the GEYSIR system became functional while it is still being developed. Indeed, it continues to be developed for additional use-cases while it is being used by project PIs.

It should be re-emphasized that the described use case and its variants will provide the translational researcher with information to identify potential allelic variants for use in association genetics tests. However, this is not sufficient for the applied practitioner interested in using DNA-based diagnostic markers. To meet this ultimate goal, the results of association genetics and validation tests will need to be added to and integrated within the GEYSIR system. For example, we have defined a next use-case: A researcher would like to enter the system through knowledge of signal transduction or cell signaling pathways, observe the results of microarray experiment superimposed on the pathway, observe the genomic locations of the genes involved in the pathway, and identify the allelic variants of genes involved in the pathway. To address this use-case we will incorporate information from pathways and data from gene expression assays into the system. This will allow the Flash interface to dynamically display results from microarrays on linkage maps and known pathways. Such functional viewers will be integrated with the existing viewers to accommodate selection of allelic variants at CGL for the design of primers (tagSNPs) to be used in the subsequent association genetic and validation studies.

As noted in the Materials and Methods, we imported multiple sources of data into GEYSIR (Fig. 4 ). However, due to its generic n-tiered architecture, GEYSIR is agnostic with respect to the specific data and information resources that are displayed through the Flash interface. Thus, the system will operate with plant genomic and QTL information. Therefore, it is possible to deploy the GEYSIR system for use by translational plant researchers. Because the full complement of omics information available to biomedical researchers is not yet available for any crop species, additional information about syntenic relationships between the crop species and a reference species will need to be integrated into the system before it can be used effectively by plant biologists. Gonzales et al. (2007) illustrated how such information from M. truncatula can be leveraged for identification of a CGL in soybean. Important information resources needed for a plant version of GEYSIR (Fig. 5 ) include:

QTL maps from the crop species.
A trait nomenclature and ontology that transcends plant species.
Syntenic relationships between crop species of interest and a reference species with genomic information.
A genetic polymorphism nomenclature and ontology that transcends plant species.
Physical maps from the reference species.
Annotated and curated gene structures in the reference species.
Resequenced CGL using germplasm from the crop species of interest.


Figure 4
View larger version (45K):
[in this window]
[in a new window]

 
Figure 4. Sources of data and information that provide the current content for the GEYSIR system.

 

Figure 5
View larger version (43K):
[in this window]
[in a new window]

 
Figure 5. Sources of data and information that would be needed by the GEYSIR system to support applied plant biologists.

 
Much of these data and information resources are being assembled in clade-oriented databases such as LIS (http://www.comparative-legumes.org/) and Gramene (http://www.gramene.org/). Plant trait and polymorphism ontology efforts have been initiated, but are not yet available. Resequencing of CGL in crop species are envisioned and have begun or have been proposed in maize (Yu and Buckler, 2006; S. Moose, personal communication, 2006; P. Schnable, personal communication, 2006) and soybean (Zhu et al., 2003) but are not yet producing comprehensive polymorphisms on a genome scale. The emergence of high throughput resequencing technologies will accelerate such developments (Leamon et al., 2003). Soon it will be possible to conduct association genetics studies using the candidate gene approach in crops such as maize (Yu and Buckler, 2006).


    Conclusions
 TOP
 ABSTRACT
 INTRODUCTION
 Methods and Materials
 Results
 Discussion
 Conclusions
 APPENDIX
 REFERENCES
 
We have demonstrated that it is possible to integrate and present information produced by omics technologies in a format that is useful for translational research purposes. The ultimate goal of this effort in translational bioinformatics is to produce sufficient information to support development of diagnostic biomarkers that can be used for population genetics research and crop improvement. Because the design and architecture of the GEYSIR software system is agnostic with respect to the sources of data, it will be possible for plant breeders to use the system with data obtained from applied research and the clade-oriented databases. The next challenge will be to use this information in development of allele-specific information that will enable plant breeders to develop biomarkers for use in selection.


    APPENDIX
 TOP
 ABSTRACT
 INTRODUCTION
 Methods and Materials
 Results
 Discussion
 Conclusions
 APPENDIX
 REFERENCES
 
Description of software system architecture and software practices
Software System Architecture
We designed the system with a classic three-tier architecture consisting of a client side user interface, based on Flash (http://www.macromedia.com/software/flash/flashpro/), a web server, and a database server. The result of this architecture is a web-based software application, described in the Results and accessed over the Internet (http://geysir.ncgr.org/).

Usually web-based tools for exploring genomic data are statically rendered HTML pages, which lack live interactivity and are often cumbersome for scientific discovery. To address these issues, we developed an interactive and responsive Flash application enabling the exploration of a wide scale of genomic data in a single genetics-based view that can include data spanning all chromosomes, results of LD and association tests, marker sets, gene neighborhoods, and SNPs. The Flash client is built using ActionScript 2.0 and targeted for the Flash 7 or greater plug-in.

The web server tier consists of Apache HTTP Server v. 2.0.54 (http://httpd.apache.org/ verified 22 Oct. 2007) and Apache Tomcat Servlet engine v. 5.59 (http://tomcat.apache.org/ verified 22 Oct. 2007). The web server application is written in the Java language, including Java 2 Platform Enterprise Edition (J2EE) technologies such as Servlets, Java Database Connectivity (JDBC), and Java Server Pages (JSP). The application was designed using the Struts web application framework (http://struts.apache.org/ verified 22 Oct. 2007).

The database server is a Sybase Adaptive Server Enterprise (http://www.sybase.com/products/informationmanagement/adaptiveserverenterprise/ verified 22 Oct. 2007). The database schema are based on the GMOD database schema known as Chado (http://www.gmod.org/ verified 22 Oct. 2007). The Chado schema enables flexibility through its inherent reliance on controlled vocabularies and ontologies to define data types in the database. We implemented Chado with the Sequence Ontology Feature Annotation (SOFA) subset of the Sequence Ontology (Eilbeck et al., 2005) to define the features that can be directly located on a biological sequence, including genes and SNPs.

Software Engineering Practices
Requirements and design phases utilized Rational Unified Process guidelines, Unified Modeling Language (UML), and the Enterprise Architect tool to produce artifacts such as use-case diagrams, sequence or activity diagrams, and class diagrams. MS Power Point was used to prototype user interfaces. The software code base and architectural design adhered to object-oriented programming practices, design patterns, and testing or coding standards to ensure extensibility, reusability, stability, and maintainability.

The software development team developed over 200 unit tests for the Java, Perl, and Flash code bases (JUnit, PerlUnit, and AsUnit respectively) and automated unit testing and code base using Apache Ant (http://ant.apache.org/). Configuration control was managed with tagged and versioned releases using the Concurrent Versions System (CVS). Java, Perl, and ActionScript (e.g., JavaDoc) were used for documenting in line code. A "circle-back" phase was performed after each release to move the code quality from working to hardened, production-worthy code and UML artifacts were updated.

We wish to express our gratitude to the many members of the immune response population genetics project team and to an anonymous reviewer who provided a number of helpful suggestions.

Received for publication August 7, 2006.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 Methods and Materials
 Results
 Discussion
 Conclusions
 APPENDIX
 REFERENCES
 





This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Google Scholar
Right arrow Articles by Beavis, W. D.
Right arrow Articles by Baxter, S. M.
PubMed
Right arrow Articles by Beavis, W. D.
Right arrow Articles by Baxter, S. M.
Agricola
Right arrow Articles by Beavis, W. D.
Right arrow Articles by Baxter, S. M.
Related Collections
Right arrow Bioinformatics
Right arrow Functional Genomics
Right arrow Biometrics


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
The SCI Journals Agronomy Journal Vadose Zone Journal
Journal of Plant Registrations Soil Science Society of America Journal
Journal of Natural Resources
and Life Sciences Education
Journal of
Environmental Quality