|
|
||||||||
National Center for Genome Resources, 2935 Rodeo Park Dr. East, Santa Fe, NM 87505. Funded, in part, by NIH-NIAID HHS200400064C, NSF BDI-0516487, and USDA-ARS SCA 58-3625-2-109
* Corresponding author (wdbeavis{at}agron.iastate.edu).
| ABSTRACT |
|---|
|
|
|---|
Abbreviations: CGL, candidate genetic loci CMTV, Comparative Map and Trait Viewer GEYSIR, Genome Exploration and Survey of Immune Response KEGG, Kyoto Encyclopedia of Genes and Genomes LD, linkage disequilibrium LIS, Legume Information System NCBI, National Center for Biotechnology Information NCGR, National Center for Genome Resources NIAID, National Institute of Allergy and Infectious Disease NIH, National Institutes of Health OMIM, Online Mendelian Inheritance in Man PI, principal investigator QTL, quantitative trait loci SNP, single nucleotide polymorphism SSWAP, Simple Semantic Web Architecture and Protocol TEAM, A Tool for the Integration of Expression and Linkage in Association Maps XML, extensible markup language
| ACKNOWLEDGMENTS |
|---|
Received for publication August 7, 2006.
National Center for Genome Resources, 2935 Rodeo Park Dr. East, Santa Fe, NM 87505. Funded, in part, by NIH-NIAID HHS200400064C, NSF BDI-0516487, and USDA-ARS SCA 58-3625-2-109
* Corresponding author (wdbeavis{at}agron.iastate.edu).
Genomics and bioinformatics are expected to revolutionize crop improvement. We have to admit, however, that while information from the various "omics" is being used by developmental and evolutionary biologists, the information is not being used routinely by translational plant biologists or applied plant breeders. This is due to a failure to provide information from omics technologies in formats that can be used to develop more effective and efficient assays that the breeder can use for selection. A similar situation exists in biomedical research where more efficacious therapies will be realized if omics information is translated into biomarker-based diagnostics. Because data are still lacking in plant science, herein we describe the development of an integrated web-based system that supports translational research using a biomedical example that can serve as a model for translational plant research. We also discuss how the bioinformatic system is agnostic with respect to data content and is capable of accepting omics data that eventually will be generated by plant biologists for use by plant breeders to develop diagnostic biomarkers for use in selection.
Abbreviations: CGL, candidate genetic loci CMTV, Comparative Map and Trait Viewer GEYSIR, Genome Exploration and Survey of Immune Response KEGG, Kyoto Encyclopedia of Genes and Genomes LD, linkage disequilibrium LIS, Legume Information System NCBI, National Center for Biotechnology Information NCGR, National Center for Genome Resources NIAID, National Institute of Allergy and Infectious Disease NIH, National Institutes of Health OMIM, Online Mendelian Inheritance in Man PI, principal investigator QTL, quantitative trait loci SNP, single nucleotide polymorphism SSWAP, Simple Semantic Web Architecture and Protocol TEAM, A Tool for the Integration of Expression and Linkage in Association Maps XML, extensible markup language
| INTRODUCTION |
|---|
|
|
|---|
Until recently translational biomedical research was similarly lacking, that is, omics data and bioinformatics were not producing efficient, effective diagnostics and subsequent therapies. Thus, medical clinicians have not used emerging genomics-based knowledge in bedside practice. Beginning in 2003, the National Institute of Allergy and Infectious Disease (NIAID), one of the National Institutes of Health (NIH), began to address the gap between basic and applied medical research by funding efforts to translate structural and functional genomic information into useful diagnostics. Herein we present a translational biomedical research project that is focused on development of DNA-based biomarker diagnostics, describe how translational bioinformatics can provide the necessary tools to identify these diagnostic biomarkers, and discuss how such an approach can be used for translational plant science.
When the human genome sequencing effort began, there was considerable debate within the biomedical research community about whether this fundamental information would enable the NIH to meet its mission of developing useful diagnostics and therapeutics for human health (Baxevanis, 2001). While the completion of a canonical human genome sequence has not directly created diagnostic or therapeutic applications, it is recognized that the complete genome sequence, coupled with automated annotation based on powerful gene modeling algorithms, has greatly enhanced the ability to locate candidate genetic loci (CGL). Subsequent map-based cloning and resequencing of alleles at CGL from affected and unaffected individuals (i.e., association genetics) has revealed the allelic basis of over 1200 simple Mendelian diseases (Botstein and Risch, 2003). Further bioinformatic analyses have revealed that over 80% of these diseases result from deletions or insertions and missense mutations in exons. These mutations in the translated DNA sequence have enabled development of DNA-based diagnostic tests for these simply inherited diseases, even though therapies for most still await better understanding of the molecular mechanisms that link genotypes to phenotypes.
As with the human genome, the full genome sequence from Arabidopsis and "gene-space" sequences from reference plant species such as rice (Oryza sativa L.), medicago (Medicago truncatula Gaertner), poplar (Populus trichocarpa L.), maize (Zea mays L.), and soybean [Glycine max (L.) Merr.] have been combined with gene modeling algorithms to annotate the sequences. Functional annotation is proceeding with both computational methods (Kriventseva et al., 2005) and through experimental approaches (http://www.nsf.gov/pubs/2005/nsf05624/nsf05624.htm; verified 15 Nov. 2007). The advantage of functional genomic studies in Arabidopsis and reference plant species relative to humans is that mutations can be induced and evaluated for major Mendelian phenotypes. The disadvantage is that these induced mutants cannot be used directly for diagnostic tests in crop species, unless the crop species is also a reference species (e.g., rice). Nonetheless, bioinformaticists have leveraged this information to develop comparative genomics analyses based on homology among plant species. These methods have been incorporated into automated structural and functional annotations for genes in crop species (e.g., see the TIGR Gene Indices, http://www.tigr.org/tdb/tgi/plant.shtml; verified 15 Nov. 2007). Identification and structural annotation of sequences based on gene models is still not sufficient, however, for use in selection of complex and quantitative traits because the breeder needs the sequence for the allelic variant associated with the desirable trait to develop a high throughput, heritable assay.
The successful identification of allelic variants responsible for simple Mendelian traits prompted biomedical researchers to evaluate the use of association genetics to identify the allelic basis of variability in complex and quantitative traits. Ideally, they would like to do this on a whole genome basis because of the inherent bias in the candidate gene approach. Unfortunately, to have reasonable power to find the allelic associations with diseases of moderate relative risks could require assays of millions of common single nucleotide polymorphisms (SNPs) on thousands of individuals (Botstein and Risch, 2003). Even the considerable budgets of NIH cannot support such studies with current technologies. As an alternative, biomedical researchers have pursued association genetic studies of complex diseases using prior knowledge about CGL. The candidate gene approach has been successful in identifying allelic variants associated with of a large number of complex traits including asthma, Alzheimer disease, breast cancer, stroke, and hyperlipidemia, (e.g., see Chasman et al., 2004; Drysdale et al., 2000; Gretarsdottir et al., 2003; Judson et al., 2004; Lohrisch and Piccart, 2000, 2001a, 2001b; Poirier et al., 1995; Rogaeva et al., 2007; Steinthorsdottir et al., 2007; Winkelmann et al., 2003).
The general candidate gene approach consists of (i) identification of CGL, followed by (ii) an association genetics study, and (iii) validation of any statistically significant associations. For purposes of this manuscript we will focus on translational bioinformatics to facilitate identification of CGL. To identify CGL the translational researcher needs to integrate information derived from different types of data from unrelated experiments. At the very least this could require literature studies in genetics, molecular biology, biochemistry, and physiology. With the emergence of numerous high throughput biotechnologies (omics data), the researcher is also faced with the need to aggregate and integrate data from multiple databases and websites. For plant species additional data based on syntenic relationships among genetic maps may be necessary to leverage sequence information from reference species to the crop species of interest (Gonzales et al., 2007). From a translational research perspective, the data and information must be presented and communicated in an understandable format for the translational researcher interested in subsequent development of DNA-based assays. Given the familiarity and ubiquitous use of web-based browsers by translational researchers, a further requirement is for user interfaces to be implemented in a system easily accessed with standard web-based browsers.
Since the necessary omics data resources in crop species are still being developed we will use a biomedical project to illustrate these points by describing a software system that we recently developed for biomedical researchers interested in identification of CGL for development of DNA-based diagnostics in human infectious diseases.
Background of a Translational Research Project
During the last 10 to 15 yr, a great deal has been learned about the human immune system. From the literature, we are aware of about 1000 genes that are likely involved in immune and allergic responses and we know many of the genetic networks and pathways through which these genes operate (Hunter and Reiner, 2000; Harty et al., 2000; Takeda et al., 2003). For this particular project we have been studying the relationship between the innate and adaptive immune systems. Specifically, we are interested in the impact of polymorphisms in the Toll receptor genes of dendritic cells and their impact on signal transduction pathways affecting T cells and their subsequent Th1 and Th2 responses to infection (see Nishimura [2001] for these pathways). Effective vaccines depend on proper functioning of this system to produce a Th1 cell response. Th1 cells are "programmed" to recognize protein epitopes from the vaccine. Once programmed, Th1 cells produce pro-inflammatory cytokines that stimulate destruction of microbial pathogens. In some individuals, a vaccine will illicit a Th2 cell response causing production of a class of cytokines that cause hyperallergenic and inflammatory reactions that can lead to severe disease and death. Whether or not genetic variants are associated with these different cellular responses and the corresponding clinical syndromes are currently unknown.
In 2004 the NIAID requested proposals to develop diagnostic biomarkers that are predictive of unfavorable immunological responses to infections and vaccines. Twenty-five years ago, smallpox vaccines were given routinely and as many as one in a million vaccinations resulted in death and at least 1 in 50 thousand vaccinations resulted in adverse reactions (Henderson, 1996). Since 2001, there has been serious debate about whether society will accept such high rates of negative side effects should the smallpox vaccine have to be readministered (Lane and Goldstein, 2003). Additionally, vaccines for influenza can cause severe allergic reactions that occasionally result in death. To address this challenge the National Center for Genome Resources (NCGR) partnered with deCODE Genetics (Reykjavik, Iceland) and the University of New Mexico Health Sciences Center (Albuquerque) to develop DNA-based biomarkers that are predictive of unfavorable responses to smallpox and influenza vaccines. An applied clinical goal of the project is a rapid biomarker assay that can be used to advise people seeking vaccination about their relative risk for an unfavorable reaction.
Although the specific genes and phentoypes of this project are not relevant to plant breeders, the higher level applied goal is. In order for plant breeders to take advantage of omics information they will need "field kits" consisting of DNA-based biomarkers that will enable them to rapidly predict the response of varieties to treatments, whether those are herbicide treatments, pathogen attacks, or drought. As with the human immune response system, almost all plant responses are complex and polygenic, thus the candidate gene approach to identifying DNA-based biomarkers associated with complex traits will be the same.
| Methods and Materials |
|---|
|
|
|---|
Software Development and Project Management
The Agile project management process (Schwaber, 2004) was used for biweekly software releases. This included iteration planning, sometimes referred to as sprint planning, involving scientists and software developers and daily 15-min status meetings, or Scrums, by the software development team. The technical aspects of software system architecture and software practices are described in the Appendix.
Data and Information Sources
During the iterative planning process the project team identified multiple sources of data and information involving genetic factors related to human immune responses. An immune response candidate gene list, based on immune response literature, was curated and reviewed by the project principal investigators (PIs). Genetic maps and microsatellite marker sets were contributed by collaborators at deCODE Genetics (Kong et al., 2002). Human chromosome sequence maps, genes and SNPs were retrieved from the National Center for Biotechnology Information's (NCBI's) RefSeq project Build 35 v.1:chromosome, sequence maps from the Nucleotide db (http://www.ncbi.nlm.nih.gov/sites/entrez?db=nuccore/ verified 22 Oct. 2007), genes from EntrezGene (http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene/ verified 22 Oct. 2007), and sequence variants from dbSNP (http://www.ncbi.nlm.nih.gov/sites/entrez?db=snp/ verified 22 Oct. 2007). Supporting functional information resources such as literature citations are hyperlinked, to PubMed and Online Mendelian Inheritance in Man (OMIM), as well as pathway information from the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al., 2006) and BioCarta (http://www.biocarta.com/genes/allpathways.asp/ verified 22 Oct. 2007).
In addition to pre-existing data and information the project is generating its own experimental data. For example we will generate likelihood statistics on a whole genome basis from LD studies, on a haplotype basis from association genetics studies, and predicted phenotypes from dendritic cell–based expression arrays from independent samples in validation experiments.
We used the extensible markup language (XML), a technology neutral data interchange format, for data transfer and management. XML was chosen because of its ubiquitous parsing engine support. Defining what constitutes valid XML and what the XML should look like is explicitly declared using document type definition pages.
| Results |
|---|
|
|
|---|
Use-Case Illustration
Upon entry to GEYSIR it is possible to observe any and all likelihood plots across the whole genome and these can be compared among multiple studies (Fig. 1a
). The ability to compare results from multiple studies and across the entire genome in a single view is enabled by converting the likelihood surface into a "heat map" in which the researcher has the ability to assign colors to the likelihood values generated by the statistical analyses. This capability was originally developed for translational researchers working with plant breeders at CGIAR centers through the comparative map and trait viewer system (Sawkins et al., 2004).
|
through the JAK/STAT cell signaling pathway (Fig. 2
) (Trinchieri et al., 2003) and note that its receptor, IL12Rβ2, is one of the candidate genes under the likelihood peak. With a "control click" on IL12Rβ2, or any genes associated with immune responses, it is possible to view the supporting information from the literature and observe the known pathways in which the gene participates (Fig. 1c).
|
A variant on this use-case is to enter the system with a gene of interest and a desire to know if it is involved in immunologic response networks and whether it is implicated in project LD studies and to identify the various alleles for use in association studies. Once signed onto GEYSIR it is immediately possible to query the system for all curated genes involved with immune responses (Fig. 3 ). Selecting the gene of interest, for example, IL12Rβ2, generates a view of the location for the gene in the genome (Fig. 1b) with its associated genetic and physical maps. Subsequent steps of viewing the gene in the context of its pathways and identification of SNPs in the context of its structural annotation and sequence are the same as presented in Fig. 1c and 1d.
|
| Discussion |
|---|
|
|
|---|
From the biologist's perspective, one of the more important activities of bioinformatics is to aggregate and integrate disparate data and information resources into a single system. There are numerous approaches to integration (Siepel et al., 2001a, 2001b) and many have been used to develop plant information databases. As the model species for understanding plant developmental biology, Arabidopsis thaliana has been used to generate enormous amounts of omics data through the 2010 project (http://www.nsf.gov/pubs/2005/nsf05624/nsf05624.htm). The Arabidopsis Information Resource (TAIR, http://www.Arabidopsis.org/ verified 22 Oct. 2007) is the primary and most comprehensive plant information system with genetic, genomic, transcript (expressed sequence tag and microarray), pathway, and metabolomic data. MaizeGDB (http://www.maizegdb.org/ verified 22 Oct. 2007) provides a similar, although less comprehensive, resource for developmental biologists and geneticists working on maize. On the other hand, plant evolutionary biologists are served by systems that integrate a single data type, across multiple species. For example, PlantGDB (http://www.plantgdb.org/ verified 22 Oct. 2007) integrates DNA sequence data for purposes of supporting comparative genomics. And, a few plant information systems integrate information across both species and data-types in what are known as clade-oriented databases (Stein et al., 2006) such as Gramene (http://www.gramene.org/ verified 22 Oct. 2007) and the Legume Information System (LIS; http://www.comparative-legumes.org/ verified 22 Oct. 2007).
While these integrated systems are providing significant support for developmental and evolutionary plant biologists, they have not been designed for or used by translational plant researchers. Thus, omics information and knowledge are not being used by applied plant breeders. There are several reasons for this lack of translation including:
Indeed, most of the data and information used by the biomedical project that motivated GEYSIR will come from publicly available websites; <15% will be generated by the project. Because it is difficult for the biomedical researcher to collect, manage, analyze, and interpret all of this data, especially when it is distributed among websites and project databases, the NIH has recognized the need for translational bioinformatics. Like most biomedical researchers, translational plant researchers will not have the skills and time to translate existing information into effective and practical knowledge if the information is not aggregated, integrated, and provided through intuitive interfaces.
We are aware of two previous projects that have addressed the issues of aggregating, integrating, and presenting omics data in a format useful for translational research: Comparative Map and Trait Viewer (CMTV) (Sawkins et al., 2004) and A Tool for the Integration of Expression and Linkage in Association Maps (TEAM) (Franke et al., 2004). While CMTV is extremely flexible and very powerful at integrating disparate sources of information, it is a system that was developed for purposes of exploration and discovery. Thus, it did not meet the requirement of presenting information in a single view with commonly used web browsers. TEAM was designed to address translational research goals and met many of the requirements of our biomedical research project. However, it did not meet the necessary requirement of presenting information in a single view using a dynamic interactive user-interface easily accessible by commonly used web browsers.
It should be emphasized that our current implementation of GEYSIR does not aggregate all data into a single database system. It relies on linkages to several sources of information, for example, OMIM, KEGG, and BioCarta and it is well known that publicly funded web resources can be volatile. Because of our close association with BioMOBY services (Wilkinson and Links, 2002), Semantic MOBY (http://biomoby.open-bio.org/index.php/semantic-moby/ verified 22 Oct. 2007) and the recently developed Simple Semantic Web Architecture and Protocol (SSWAP) (Gessler, 2008), a fusion of semantic BioMOBY and BioMOBY services, we think GEYSIR will be more effectively deployed using SSWAP as soon as these technologies become more mature.
The GEYSIR system illustrates that it is possible to rapidly design, develop, and implement a system to help the biomedical researcher mine multiple information resources through a single dynamic web-based interface. Equally compelling, the GEYSIR system became functional while it is still being developed. Indeed, it continues to be developed for additional use-cases while it is being used by project PIs.
It should be re-emphasized that the described use case and its variants will provide the translational researcher with information to identify potential allelic variants for use in association genetics tests. However, this is not sufficient for the applied practitioner interested in using DNA-based diagnostic markers. To meet this ultimate goal, the results of association genetics and validation tests will need to be added to and integrated within the GEYSIR system. For example, we have defined a next use-case: A researcher would like to enter the system through knowledge of signal transduction or cell signaling pathways, observe the results of microarray experiment superimposed on the pathway, observe the genomic locations of the genes involved in the pathway, and identify the allelic variants of genes involved in the pathway. To address this use-case we will incorporate information from pathways and data from gene expression assays into the system. This will allow the Flash interface to dynamically display results from microarrays on linkage maps and known pathways. Such functional viewers will be integrated with the existing viewers to accommodate selection of allelic variants at CGL for the design of primers (tagSNPs) to be used in the subsequent association genetic and validation studies.
As noted in the Materials and Methods, we imported multiple sources of data into GEYSIR (Fig. 4 ). However, due to its generic n-tiered architecture, GEYSIR is agnostic with respect to the specific data and information resources that are displayed through the Flash interface. Thus, the system will operate with plant genomic and QTL information. Therefore, it is possible to deploy the GEYSIR system for use by translational plant researchers. Because the full complement of omics information available to biomedical researchers is not yet available for any crop species, additional information about syntenic relationships between the crop species and a reference species will need to be integrated into the system before it can be used effectively by plant biologists. Gonzales et al. (2007) illustrated how such information from M. truncatula can be leveraged for identification of a CGL in soybean. Important information resources needed for a plant version of GEYSIR (Fig. 5 ) include:
|
|
| Conclusions |
|---|
|
|
|---|
| APPENDIX |
|---|
|
|
|---|
Usually web-based tools for exploring genomic data are statically rendered HTML pages, which lack live interactivity and are often cumbersome for scientific discovery. To address these issues, we developed an interactive and responsive Flash application enabling the exploration of a wide scale of genomic data in a single genetics-based view that can include data spanning all chromosomes, results of LD and association tests, marker sets, gene neighborhoods, and SNPs. The Flash client is built using ActionScript 2.0 and targeted for the Flash 7 or greater plug-in.
The web server tier consists of Apache HTTP Server v. 2.0.54 (http://httpd.apache.org/ verified 22 Oct. 2007) and Apache Tomcat Servlet engine v. 5.59 (http://tomcat.apache.org/ verified 22 Oct. 2007). The web server application is written in the Java language, including Java 2 Platform Enterprise Edition (J2EE) technologies such as Servlets, Java Database Connectivity (JDBC), and Java Server Pages (JSP). The application was designed using the Struts web application framework (http://struts.apache.org/ verified 22 Oct. 2007).
The database server is a Sybase Adaptive Server Enterprise (http://www.sybase.com/products/informationmanagement/adaptiveserverenterprise/ verified 22 Oct. 2007). The database schema are based on the GMOD database schema known as Chado (http://www.gmod.org/ verified 22 Oct. 2007). The Chado schema enables flexibility through its inherent reliance on controlled vocabularies and ontologies to define data types in the database. We implemented Chado with the Sequence Ontology Feature Annotation (SOFA) subset of the Sequence Ontology (Eilbeck et al., 2005) to define the features that can be directly located on a biological sequence, including genes and SNPs.
Software Engineering Practices
Requirements and design phases utilized Rational Unified Process guidelines, Unified Modeling Language (UML), and the Enterprise Architect tool to produce artifacts such as use-case diagrams, sequence or activity diagrams, and class diagrams. MS Power Point was used to prototype user interfaces. The software code base and architectural design adhered to object-oriented programming practices, design patterns, and testing or coding standards to ensure extensibility, reusability, stability, and maintainability.
The software development team developed over 200 unit tests for the Java, Perl, and Flash code bases (JUnit, PerlUnit, and AsUnit respectively) and automated unit testing and code base using Apache Ant (http://ant.apache.org/). Configuration control was managed with tagged and versioned releases using the Concurrent Versions System (CVS). Java, Perl, and ActionScript (e.g., JavaDoc) were used for documenting in line code. A "circle-back" phase was performed after each release to move the code quality from working to hardened, production-worthy code and UML artifacts were updated.
We wish to express our gratitude to the many members of the immune response population genetics project team and to an anonymous reviewer who provided a number of helpful suggestions.
Received for publication August 7, 2006.
| REFERENCES |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| The SCI Journals | Agronomy Journal | Vadose Zone Journal | |||
| Journal of Plant Registrations | Soil Science Society of America Journal | ||||
| Journal of Natural Resources and Life Sciences Education |
Journal of Environmental Quality |
||||