Crop Science Grow Your Career with CSSA
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Published online 23 February 2005
Published in Crop Sci 45:748-757 (2005)
© 2005 Crop Science Society of America
677 S. Segoe Rd., Madison, WI 53711 USA
This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via ISI Web of Science (2)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Burgueño, J.
Right arrow Articles by Autran, D.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Burgueño, J.
Right arrow Articles by Autran, D.
Agricola
Right arrow Articles by Burgueño, J.
Right arrow Articles by Autran, D.
Related Collections
Right arrow Biometrics

Spatial Analysis of cDNA Microarray Experiments

Juan Burgueñoa, Jose Crossaa,*, Daniel Grimanellib, Olivier Leblancb and Daphne Autranb

a Biometrics and Statistics Unit, International Maize and Wheat Improvement Center (CIMMYT), Apdo. Postal 6-641, 06600 México DF, México
b Institut de Recherche pour le Développement (IRD), Apdo. Postal 57297, 06501 México DF, México

* Corresponding author (j.crossa{at}cgiar.org)


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Microarray experiments allow RNA level measurements for many genes in multiple samples. However, mining the biological information from the large sets of data generated by microarrays requires the use of appropriate statistical methods to adjust the observed values for experimentally introduced variability (normalization process) before testing differences among samples. Normalization of microarray experiments is a critical step for reducing false positives and false negatives. This paper explores the normalization of cDNA microarray experiments by a method that uses the blank spot intensity values to make spatial adjustment (SA) of both foreground and background DNA spot intensity values, by fitting an autoregressive mixed linear model through the residual maximum likelihood (REML) methodology in the direction of the rows and the columns of the microarray. Application of this spatial normalization to three cDNA array experiments serves as a case study to validate the SA. Results show that the spatial analysis allows selection of candidate genes with lesser numbers of false positive and false negative genes.

Abbreviations: AR, separable autoregressive correlation • GA, global adjustment • LA, lowess adjustment • REML, residual maximum likelihood • SA, spatial adjustment


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
MICROARRAYS ARE FAST BECOMING an essential tool in modern biology and the life sciences. Because of the complicated steps involved in generating microarray data and intrinsic experimental variability, microarray experiments are usually very noisy and have no fixed scale. Thus, statistical methods for appropriate experimental designs and data standardization and normalization are essential for identifying spots with true differential expression when genes are applied to the array.

In the process of printing microarrays, of obtaining, labeling, and hybridizing RNA, and of reading and quantifying the results, several experimental factors can cause systematic spatial variability within and among arrays and slides, thus distorting the estimation of gene expression patterns and contrasts. Some of these factors include unequal amounts of DNA deposited at each spot, the intensity-dependent dye effect, reverse transcription efficiency, and settings and calibration of the laser scanner (Schuchhardt et al., 2000). A good experimental design, if taken into account when designing the slides, can reduce the impact of some of these disturbance factors. Unless these factors are appropriately accounted for in the statistical model, they will significantly reduce the precision of estimates of gene expression and gene expression contrasts.

Some normalization methods assume that the two channels of intensities are related by a constant so that the center of the distribution of the log of the ratio is shifted to zero. Chen et al. (1997) proposed an iterative method for estimating that constant. Kerr et al. (2000) and Wolfinger et al. (2001) performed the normalization within the ANOVA. These methods do not consider that the relationship between channels (Cy3 and Cy5) change with the intensity of the signal. The lowess smoothing normalization method (Yang et al., 2002) performs a local linear fit and is effective for removing the dye effect. If the lowess correction is based on all the genes, then it should be assumed that most genes are not differentially expressed.

The dye effect is not the only source of systematic error in microarray experiments. Other spatial effects can cause systematic error and are not considered in some standard normalization methods. Several authors have found spatial variation in microarray experiments and considered different spatial models to normalize gene expression in microarray data (Balázsi et al., 2003; Wilson et al., 2003; Yang et al., 2003; Wernisch et al., 2003; Baird et al., 2004). Wernisch et al. (2003) used a spatial lowess surface after lowess normalization. Wilson et al. (2003) proposed an intensity-based color normalization by fitting a single lowess curve to the transformed data and smoothing the residuals with a median filter to estimate spatial trend. They found that a 3-by-3 block of spots is appropriate, but this may not be the case in other arrays, which introduces the problem of defining the block size. Hoffmann et al. (2002) compared different normalization procedures and statistical analyses for detecting differentially expressed genes and concluded that normalization had a severe influence on the detection of genes with different expressions.

Mixed linear models with a specific variance–covariance structure can be used to control the spatial variability within and among slides of cDNA microarray experiments. In this study, we model the spatial variability of microarray experiments by considering the spatial allocation of the spots within the slide. The separable autoregressive correlation (AR) structure proposed by Cullis and Gleeson (1991) and Gilmour et al. (1997), extensively used in agriculture field experiments, was used for studying spatial variability in the slide of cDNA microarray experiments. Baird et al. (2004) used this SA for normalizing microarray experiments including all the spots on the array. We applied this SA to three different sets of cDNA microarray experiments, but used only the blank spots present on the array. Mixed linear models based on SA adjusted data were used as an a priori procedure before selecting the candidate genes with significant expressions. Also, we used the global adjustment (GA) of intensity following Chen et al. (1997) and the lowess normalization adjustment (LA) of Yang et al. (2002). For Data Set 1, a random sample from the subset of selected genes obtained by the SA and GA methods was validated using a standard semiquantitative RT-PCR protocol to estimate the frequency of false positives. Data Set 3 is a white experiment where the same sample was marked with the two dyes, and results from the SA adjustment were compared with those results from the LA in terms of the number of false positives.


    MATERIALS AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Normalization Model Using the Separable Autoregressive Correlation Structure
For an experiment arranged in rows and columns, Gleeson and Cullis (1987) proposed sequentially fitting a class of autoregressive-integrated-moving average models to the errors in one direction (rows or columns). Cullis and Gleeson (1991) extended the previous model to two directions (rows and columns) assuming that rows and columns in the experiment are regularly spaced. Gilmour et al. (1997) proposed using a separable autoregressive (AR) correlation structure. Thus, they model the experimental error variability as the direct product of an AR correlation structure for columns and an AR correlation structure for rows, denoted by AR1 x AR1. The AR1 x AR1 model is flexible enough to generally represent many different spatial patterns, and often more than one spatial model can be considered to accommodate other sources of variability.

The AR structure proposed by Cullis and Gleeson (1991) and Gilmour et al. (1997) has been used in analyzing agricultural experiments, but it can be applied to study spatial variability in the slide of cDNA microarray experiments by considering the following initial model for background and foreground signals (yij) in each dye:

Formula 1[1]
where µ is the overall mean, b refers to blanks effect, sij is the effect of the spot located in the ith row and the jth column of the slide, and {eta}ijk is the experimental error associated to the ith row, jth column, and the kth replicate of the blank spots. Here, spot is considered as a fixed effect. The AR1 x AR1 spatial model is applied to the residual variation ({eta}ijk) to correct for the spatially correlated variability within a slide. Other effects that account for row and column variation can also be added to the initial AR1 x AR1 adjustment similarly to the model used by Baird et al. (2004); however, we have not included them in this study. Burgueño et al. (2000) found that negligible changes in adjusted values occur when rows and columns effects are added to the SA due to AR1 x AR1.

The AR1 x AR1 spatial model should control variability arising from uneven experimental conditions occurring in the process of printing microarray experiments, and labeling and hybridizing RNA. Thus, this variation is modeled as the direct product of an AR correlation structure for columns and an AR correlation structure for rows. Other variability aligned with rows or columns is usually modeled with random row and column effects. Global effects include any major (nonstationary) trends across the slide. These are fitted as linear trends, cubic smoothing splines, and row and column contrasts and covariates. In this study, we fit only the AR1 x AR1 spatial model and did not add any other term to the model for fitting possible global effects.

In the context of microarray experiments, one measurement for examining the heterogeneity patterns of the slide could be the spatial autocorrelation of neighboring spots within rows or within columns, based on the correlation between residuals at various distances. If there is no spatial pattern, all the correlations will be low. If there is a pattern in the residuals, neighboring residuals will be more similar and so will have a higher correlation.

The idea behind the spatial analysis of the cDNA microarray experiments used in this study is to adjust the gene expression obtained in each spot based on the performance of the blanks. In practical terms, the SA of the foreground and background of a gene is based on the expression of the nearest blank. Thus, the values of the foreground and background of a gene near a spot with a blank with high foreground and background values will be adjusted downward, whereas the value of a gene near a blank with a low expression value will be adjusted upward. Several normalization methods are based on the assumption that most genes have similar expressions. However, this may not be the case in all microarray experiments, and thus other genes such as housekeeping genes (with constant expression) may facilitate controlling other sources of variability (Yang et al., 2002).

Model for Selecting Candidate Genes
As a second stage, a mixed linear model is used to test the significance of gene expression. This sequential approach is similar to that used by Wolfinger et al. (2001), where a normalization model is used first and then a gene model selects the most responsive genes. This is similar to that used by Kerr and Churchill (2001), but using the log of the ratio instead of using the log of the signal, that is:

Formula 2[2]
where ypqk is the log of the ratio between signals collected from two samples between the reference or base line channel (Channel 1, Cy3) and the experimental channel (Channel 2, Cy5); µ is the overall mean; ap and gq are the effects of the pth slide and the qth gene; (ag)pq is the interaction of the qth gene in the pth slide; and {eta}pqk is the experimental error associated with the pth slide, the qth gene, and kth replicate that is assumed to be independent and identically distributed with mean zero and variance {sigma}2e. Equation [2] was fitted using ASReml software (Gilmour et al., 2002). Because there are different dyes in each sample, the effect of the slide is confounded with the dye effect. At this stage, the slide and the slide x gene interaction are considered as random effects, whereas genes are considered as fixed effects. A more detailed step-by-step description of the analyses performed for the selection of candidate genes is given below.

The proposed spatial analysis using the spot intensity value to adjust the foreground and the background by means of AR1 x AR1 and then computing the adjusted signal can also be done exclusively on the foreground, on the signal, or on the log ratio. However, this option was not investigated in this study.

The Mixed Linear Model
The statistical models used in the sequential approach given in Eq. [1] and [2] above are mixed linear models. Formally, the estimation of the variance components in a general linear mixed model uses the REML approach. The equation for the general mixed linear model is:

Formula 2
where Y is the response vector, X is the matrix design for fixed effects ({tau}), and Z is the matrix design for random effects (u). The residual ({eta} = {xi} + e) is composed of: (i) the local trend ({xi}) which, in Eq. [1], is modeled by the two-dimensional autoregressive procedures in the direction of the row and columns, AR1 x AR1, of the slide


Formula

[where {sigma}e2 is the residual variance, {rho} is the correlation coefficient between rows ({rho}r) and columns ({rho}c), p is the number of rows, q is the number of columns, and {otimes} is the direct product], and (ii) the residual e after adjusting for all the other terms in the model. The random terms (u, {xi}, e) are pair-wise independent.

Sequential Approach for Identifying Candidate Genes
As already mentioned, the idea behind the spatial analysis is to adjust the gene expression based on the performance of the blanks. The analytical process uses the normalized data obtained by Eq. [1] and creates the mixed linear model used in Eq. [2]. The steps of the approach are as follows:

(i) Perform the initial AR1 x AR1 adjustment of the observed values of the foreground (Dobs.) and the observed background (Bobs.) value in both dyes by fitting the spatial model given in Eq. [1]. Compute the spatially adjusted foreground (Dadj.) and the spatially adjusted background (Badj.) of each spot. At this stage, the usual techniques for outlier detection can be used.

(ii) Eliminate all spots with a Dadj. below the threshold value given by [Badj. + 2(SE)].

(iii) Calculate the adjusted signal as Sadj. = Dadj. Badj..

(iv) If Sadj. < 0 for both treatments, the spot is normalized to zero signal. In cases where Sadj. < 0 for one treatment but significantly Sadj. > 0 for the other treatment, the information can be treated as a simple case of qualitative (±) variation in gene expression. To check the relationship between ratio and mean log intensity, construct an MA graph (M = difference in log intensities between the two channels; A = average log intensities in the two channels) by plotting A = 0.5 x [log2(Sadj.1 x Sadj.2)] vs. M = log2(Sadj.1/Sadj.2), where Sadj.1 is the adjusted signal measured in one reference channel (Cy3 or Cy5), and Sadj.2 is the adjusted signal measured in the other experimental Channel 2.

(v) Fit Eq. [2] to M = log2(Sadj.1/Sadj.2) and test the hypothesis of log2(Sadj.1/Sadj.2) = 0 vs. log2(Sadj.1/Sadj.2) != 0 for each gene.

Software
The ASReml system (Gilmour et al., 2002) was used for the spatial analysis of the microarray experiments. ASReml uses the REML estimation method to estimate variance components in the context of mixed linear models. It is a useful tool for analyzing experiments in agriculture, plant breeding, and genetics because it allows fitting the spatial variability within experiments in a number of ways.

ASReml fits the model Y = X{tau} + Zu + {eta}, with u and {xi} or e having zero mean and variance–covariance matrices given by Var(u) = G and Var({eta}) = R. The default cases are when R = {sigma}2I and Gf = {sigma}f2I.

Microarray Data Sets
Spatial analysis was conducted on three sets of maize cDNA microarrays data, herein referred to as Data Set 1, Data Set 2, and Data Set 3. The slides were obtained from the University of Arizona Maize Genome Project (complete description of the arrays is available at http://www.zmdb.iastate.edu/zmdb/microarray/arrays.php). Acquisition and quantification of array images was performed in ScanArray 4000 with its accompanying software ScanArray 4000 from Packard BioChip Technologies (Billerica, MA). All images were captured using 60% PMT gain, 70 to 75% laser power, and 10-µm resolution at a 50% scan rate. For each spot, the foreground mean value, background mean value, and signal mean values were calculated with ArrayPro Analyzer software from Media Cibernetics (Silver Spring, MD). Background was measured as a local ring with an offset. For Data Sets 1 and 2, we used RNA samples obtained from contrasting maize reproductive tissues: mature ovules before pollination vs. a bulk of developing kernels sampled 3 to 8 d following pollination.

Hybridizations, acquisition, and quantification of array images for the three data sets were performed at the Microarray Unit of the Universidad Nacional Autonoma de México in Mexico City. The raw and adjusted data will be deposited in public databases and are currently available from the third author.

Data Set 1
The microarrays used to generate Data Set 1 (Slides 46 to 49, Microarray UG1.01.02) consisted of eight print-pins on a single slide, each comprising 46 rows and 46 columns. This represents 2116 x 8 = 16928 spot positions at a space of 185 µm, with a 1000-µm gap between print-pins. Seven thousand, four-hundred and four cDNAs were printed in duplicate, 157 controls and house keepings were located in four replicates, and the remaining 1492 spot positions were blanks. Slides 46 to 49 resulted from four independent hybridizations (herein referred to as Slides 46 to 49) consisting of two repetitions with dye inversions, which produced eight estimates (2 reps x 2 dye inversions x duplicates) for each gene.

We attempted to validate the results of the SA done on Data Set 1 using a subset of the selected genes by means of a standard semiquantitative RT-PCR protocol. This allowed us to estimate the frequency of Type I errors (false positives) obtained on data with AR1 x AR1 SA and on data with GA corrections. Actin was used as a control for all RT-PCR experiments. In all cases, PCRs were performed using 17, 20, 25, 30, and 35 cycles to identify cycling conditions for which amplifications were compared during the log phase. The PCR products were resolved on agarose gels, stained with ethydium bromide. Digital images were captured and used to quantify amplification products using a Kodak EDAS 290 gel documentation system. We discarded as false positives all reactions for which the qualitative difference observed with the microarray could not be repeated with RT-PCR, regardless of the actual quantitative values measured by both methods.

Data Set 2
The microarrays used for Data Set 2 (Slides 75 and 77, Microarray UG1.01.01) had a similar design to Data Set 1, but the eight print-pins comprised 48 rows and 45 columns, and the cDNAs were printed in duplicate. Data Set 2 contained 17280 spots, which included 5373 cDNAs and controls, plus 678 blanks. The raw data sets were obtained by hybridizing the microarrays with RNAs obtained from two contrasting maize reproductive tissues.

Data Set 3
The microarrays used to generate Data Set 3 (Slides 34 and 37, Microarray UG1.02.02) consisted of eight print-pins on a single slide, each comprising 46 rows and 46 columns. This represents 2116 x 8 = 16928 spot positions at a space of 185 µm, with a 1000-µm gap between print-pins. Seven thousand, four-hundred and four cDNAs were printed in duplicate, 157 controls and housekeeping genes were located in four replicates, and the remaining 1492 spot positions were blank. This is a white experiment because the same sample was marked with both dyes (Cy3 and Cy5) and hybridized in the two slides (34 and 37). For these experiments we expected no significant change in gene expression except those due to random chance (and to be determined by the selected P value).


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Spatial, Lowess, and Global Adjustment of Data Sets 1, 2, and 3
For Data Set 1, Fig. 1 shows the spatial location of spots with blanks within Slides 46 to 49. There are 1492 blanks on the slides, representing 8.46% of the total number of spots. The blanks are spread out across the slides and along a column at the end of each print-pin. This spatial location of blanks is not optimal for an efficient SA. A more efficient allocation of blanks would be to spread them out on the slides in a triangular or diamond grid (Maas et al., 2002).


Figure 1
View larger version (33K):
[in this window]
[in a new window]

 
Fig. 1. Spatial location of blanks in Slides 46 to 49 from Data Set 1.

 
It would be effective to have at least 10% of the spots be blanks and avoid having the replicates of the genes placed together. Although these are statistically desirable conditions, they might be technically complex to implement on slides because of the added cost of the printing process. Such designs, however, would significantly improve statistical analysis of microarray data.

In the analysis the effect of the dye and of the biological replicate, and the dye x biological replicate interaction were not included in the model (Eq. [2] and Step v) because they cannot be estimated separately. Furthermore, they were included in the slide x gene interaction.

For Slide 48 of Data Set 1, the spatial trend of the blanks for the foreground and the background at two different conditions is shown in Fig. 2 . It is clear that spatial variability had a greater effect on foreground than the background readings. The response is much more patchy across the slide in Fig. 2c and 2d than in Fig. 2a and 2b. It is worth mentioning that in these figures, general variability patterns can be identified and thus could be included in Eq. [1].


Figure 2
Figure 2
Figure 2
Figure 2
View larger version (206K):
[in this window]
[in a new window]

 
Fig. 2. Interpolation of background (a, b) and foreground (c, d) of the blanks for Slide 48 in two conditions from Data Set 2 [Condition A in (a, c) and Condition B in (b, d)].

 
To analyze the effect of SA, MA graphs were constructed (Fig. 3) . Values in the y axis are M = log2(S1/S2) and in the x axis, A = 0.5 x [log2(S1 x S2)] for the spatially adjusted and unadjusted values. For each of the four slides, the SA eliminates the relationship ratio intensity. Furthermore, the regression analysis (Table 1) indicates that, except in Slide 47, the coefficient of determination (R2) and the coefficient of regression decrease when comparing unadjusted data with SA data. However, these results are not consistent for all slides when the asymmetry of the log of the ratio is considered. In view of the skewness of the log of the ratio, SA was able to decrease the skewness level in Slides 48 and 49 from 1.03 and 0.77 to 0.30 and 0.58, respectively, whereas in Slides 46 and 47, the skewness increased from 0.34 and 0.27 to 0.96 and 0.33, respectively. For Data Set 1, the relationship ratio intensity was not important and present only at low intensities; therefore, the SA corrected a great deal but not all of this effect.


Figure 3
View larger version (28K):
[in this window]
[in a new window]

 
Fig. 3. MA graph of unadjusted values (first four graphs) and spatial adjusted values (last four graphs) corresponding to Slides 46 to 49 from Data Set 1. Values in the y axis are M = log2(S1/S2) and in the x axis, A = 0.5 x [log2(S1 x S2)].

 

View this table:
[in this window]
[in a new window]

 
Table 1. Coefficient of determination (R2), regression coefficient, and their significance and number of values (n) in each slide (46 to 49) from Data Set 1 for unadjusted and spatially adjusted values (SA).

 
For Data Set 2, similar results of the SA are found, but the spatial locations of the blanks (shown in Fig. 4) for Slides 75 and 77 are much more uneven than in Data Set 1. Here, we have only 675 blanks (3.91% of the total number of spots). For this Data Set, spatial variability seemed to affect the foreground and the background equally where the response was patchy for foreground and background across both slides (Fig. 5 and 6) .


Figure 4
View larger version (19K):
[in this window]
[in a new window]

 
Fig. 4. Spatial location of blanks in Slides 75 and 77 of Data Set 2.

 

Figure 5
Figure 5
Figure 5
Figure 5
View larger version (170K):
[in this window]
[in a new window]

 
Fig. 5. Interpolation of background blanks (a, b) and foreground blanks (c, d) for Slide 75 in two conditions from Data Set 2 [Condition A in (a, c) and Condition B in (b, d)].

 

Figure 6
Figure 6
Figure 6
Figure 6
View larger version (168K):
[in this window]
[in a new window]

 
Fig. 6. Interpolation of background blanks (a, b) and foreground blanks (c, d) for Slide 77 in two conditions from Data Set 2 [Condition A in (a, c) and Condition B in (b, d)].

 
The corrections in Data Set 2 were more extensive because gene responses were more severely distorted by spatial trends in the slides than in Data Set 1. In this case, R2 decreases from 0.207 to 0.031 for Slide 75 and from 0.105 to 0.082 for Slide 77. In all cases, the distribution of the log of the adjusted ratio (Fig. 7) is more variable after the spatial analysis than without SA. From a statistical perspective, the most important aspect of the data (once the bias due to systematic trends is eliminated) is that they be distributed as symmetrically as possible (although still not normally). A symmetric distribution allows using the ANOVA at the stage when candidate genes will be identified.


Figure 7
View larger version (24K):
[in this window]
[in a new window]

 
Fig. 7. MA graph of unadjusted (left) and spatial adjusted values (right) for two slides (75, 77) in one condition from Data Set 2. Values in the y axis are M = log2(S1/S2) and in the x axis, A = 0.5 x [log2(S1 x S2)].

 
For Data Set 3, the detection of genes with significant expression was performed on 6756 adjusted spots after the LA (and after eliminating the genes with negative signal) and on 6589 genes after the SA {and after eliminating all spots with a Dobs. below the threshold value given by [Badj. + 2(SE)]}.

Validating the Number of False Positive Genes in Data Sets 1 and 3
To check the effect of SA on Data Set 1, we compared the genes selected under the SA with those selected after GA of intensity values and under LA. A total of 6862 spots (out of 15496) were eliminated from the SA data because they had Sadj. < 0 due to high background values or very low foreground values. The selection stage was therefore performed on 8634 observations corresponding to 1775 genes. Using the GA of intensity values, 7430 genes were eliminated. With the LA, no genes were eliminated.

For SA Data Set 1, the null hypothesis of log(Sadj.1/Sadj.2) = 0 was rejected at a probability level of 0.001 (Table 2) for a total of 80 genes (19 of them with ratios greater than 1.5). Using the same probability level, 59 and 47 genes were selected by the GA and LA methods, respectively. Interestingly, SA and GA selected only 13 common genes, and SA, GA, and LA selected only two common genes. A total of 67 genes were selected using the SA method only, 46 genes were selected using the GA method (Table 3), and 43 genes were selected with the LA method.


View this table:
[in this window]
[in a new window]

 
Table 2. Ratio distribution of the selected genes for Data Set 1.

 

View this table:
[in this window]
[in a new window]

 
Table 3. Number of selected, tested, and false positive genes with significant signal selected under the proposed spatial adjustment (SA) and global adjustment (GA) for Data Set 1.

 
Genes selected after the LA had a ratio that ranged from 2.30 to 3.24; 40 genes were not included by the SA because [Dobs. < Badj. + 2(SE)] or Sadj. < 0. In the SA, genes with a negative signal must be analyzed separately. In general, genes with a lower signal in one channel show an extreme log of the ratio and become outliers in the ANOVA. In some cases, the signal in both channels is negligible, and the ratio is meaningless. The reason why the genes selected by the LA were different from those selected by the SA could be that the ratio intensity was not important in Data Set 1.

To evaluate the proportion of false positive genes obtained by the SA and GA methods, we randomly selected a subset of genes with a log ratio statistically different from zero and validated their responses using RT-PCR on the same RNA samples used to generate the microarray data. We tested 12 genes that were common to both normalization methods (SA and GA), 10 that were unique to SA data, and 15 that were unique to GA data. As shown in Table 3, the level of Type I error was higher without the SA. Considering all tested genes, the SA gave 23% false positive genes against 41% given by the GA. For GA, 9 out of 10 (90%) were false negative genes. For SA, 8 out of 15 were false negative genes (53%). These estimates may be biased upward because they are based on genes selected by the opposite method (SA vs. GA, and vice versa). Since the same method was applied in both SAs, the results likely represent a fair estimate of the relative efficiency of the SA and GA methods. The SA produces both fewer false positive and fewer false negative genes than the GA. The LA method fails to detect the 25 [(12 – 4) + (10 – 1) + (15 – 7)] genes with positive expression (RT-PCR) found by the SA and GA methods.

Data Set 3 is a microarray experiment that had the same sample marked with both dyes and hybridized in the two slides. Thus, for this experiment we expected that all the ratios should be equal to one except for those due to random chance. In general, results show that SA and the LA methods gave similar number of genes with false positive responses (Table 4). However, for P values smaller than 0.005, the SA gave less number of false positives than those found by the LA. Considering that the selected P value should be low to decrease the experiment-wise error, the SA gave slightly better results than the LA method. Results show that the percentage of false positives is greater than the selected P value, and it is more similar to the experiment-wise error. To obtain an experiment-wise error of 0.001 using the Bonferroni correction, the P value needs to be 0.0000015, which gives 6 and 14 false positives for the SA and the LA, respectively.


View this table:
[in this window]
[in a new window]

 
Table 4. Number (n) and percentage of genes with significant change in expression at different P value and experiment-wise error values. Values adjusted by spatial analysis (SA) and by the lowess normalization (LA) for Data Set 3.

 

    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
The method presented here differs from previous methods in that it allows for the simultaneous correction of the ratio intensity relationship, spatial effects, and other effects not considered in this work, such as print-pin, row, and column. The general applicability of the SA as presented in this study is dependent on the number and placement of blank spots within an array. The concept of this study of adjusting the gene expression obtained at each spot based on the performance of the blank surrounding the spot is reasonable assuming that the experimental error of foreground and background fluorescence of the gene spots approximated the same process as the blank spots.

Baird et al. (2004) found that the SA with splines and including all spots of the array accounted for more variability than that using the LA. Here we have used only the blank spots present in the array for adjusting the gene spots. We speculate that because the genes and their replicates are not randomized to the spots, the adjustment including all the spots will be biased by the spatial experimental variability. The effect of the genes is confounded with other effects such as print-pin, sub-row, and sub-column inside print-pin. For detecting genes with significant expression, Hoffmann et al. (2002) used three procedures: one parametric test, one nonparametric test (both computed in the context of the ANOVA, assuming a fixed effect model), and a permutation-based method. Here we have used the ANOVA assuming a mixed model but using spatially adjusted data.

The sequential approach applied by Wolfinger et al. (2001) and Jin et al. (2001) first corrects by ratio intensity relationship and then uses a model for selecting candidate genes. As pointed out by Sebastiani et al. (2003), using a linear model and assuming that the experiment had errors distributed along their expected value of zero, normalization is not necessary.

The spatial analysis applied in this study can be used in standard replicated experiments or in unreplicated experiments but with repeated controls or blanks, and it can be performed for one or two dyes, in conjunction with any other approach for detecting candidate genes. Here, we refer to unreplicated experiments as an experiment with one replicate per slide of each gene. Also, the autoregressive model in the direction of the rows and the columns can be applied to the individual print-pins rather than to the entire slide. Furthermore, results of the SA in each condition can be used as input for cluster analysis, ANOVA, principal components analysis, permutation test for detecting genes with significant expression, and any other statistical analysis across conditions.

The SA approach weighted neighbor spots by their correlation and does not need to define a block size, as suggested by Wilson et al. (2003), nor smooth parameters, as does the lowess method. Furthermore, one advantage of the spatial approach is that it uses only the value of the blanks and not the values of the genes themselves, so the value of the genes is not affected. We can also use housekeeping genes, which are not easy to select (Fang et al., 2003).

The question of adjusting the foreground and background separately and then calculating the signal (signal = foreground – background), as opposed to adjusting the signal directly, is debatable. Different spatial variability patterns could be founded in background and foreground; therefore, adjusting the two measurements separately would seem better than adjusting the signal. By considering only one variable, the spatial model is more parsimonious than if it considered the difference between two variables. However, in some microarray experiments, the foreground and background may have the same spatial pattern, and thus adjusting a spatial model on the signal alone will be simpler and more efficient. Separate adjustment of background and foreground might produce more negative signal values than an adjustment performed directly on the signal. However, those negative signal values can be considered as values of zero and deserved further study. An option could be to spatially adjust the ratio directly, but this might introduce artifacts contained in the reference data. Some authors have suggested that the correction for background adds noise to the experiment, but others use it for calculating the signal (Tseng et al., 2001; Kooperberg et al., 2002; Smyth et al., 2002; Chen-An et al., 2003). If the background is homogeneous across the entire microarray, there is no need to correct the foreground, but if the background is heterogeneous across the slide, correcting the foreground is necessary.

Further research is needed to study the results of adjusting foreground and background separately and then computing the signal, as compared with directly adjusting the signal. Another fruitful area of research may be to examine whether spot can be considered as a random effect and not as a fixed effect, as it is in this study (see Eq. [1]). Also, in this study we fitted the AR1 x AR1 model, but other models can be fitted that include the random effect of rows or columns; no one specific model is best for all experimental situations, and techniques for model selection are available (Gilmour et al., 1997; Burgueño et al., 2000). Furthermore, the autoregressive model can be adjusted to each of the print-pin and then candidate genes selected across print-pin.

It would be useful to investigate the effect of including all the terms in the initial model: dye, slide, print-pin, condition. This would adjust every spot in each dye and slide simultaneously and not separately in two stages, as in this study. A better spatial location of the blanks (such as allocating them in a diamond pattern) and an independent and separate position of the gene replicates would improve the efficiency of the spatial analysis by producing a better estimation of the variance–covariance matrix and a more balanced correction of the spot values. Close allocation of gene replicates would be equally affected by systematic trends, and therefore would eliminate the possibility of separating the gene effect from the spatial trends.

Received for publication June 24, 2004.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 





This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via ISI Web of Science (2)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Burgueño, J.
Right arrow Articles by Autran, D.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Burgueño, J.
Right arrow Articles by Autran, D.
Agricola
Right arrow Articles by Burgueño, J.
Right arrow Articles by Autran, D.
Related Collections
Right arrow Biometrics


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
The SCI Journals Agronomy Journal Vadose Zone Journal
Journal of Natural Resources
and Life Sciences Education
Soil Science Society of America Journal
Journal of Plant Registrations Journal of
Environmental Quality
The Plant Genome