Crop Science Illumina
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via ISI Web of Science (12)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Labate, J. A.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Labate, J. A.
Agricola
Right arrow Articles by Labate, J. A.
Crop Science 40:1521-1528 (2000)
© 2000 Crop Science Society of America

REVIEW & INTERPRETATION

Software for Population Genetic Analyses of Molecular Marker Data

Joanne A. Labate

Institute for Genomic Diversity, Cornell Univ., Ithaca, NY 14853 USA

jl265{at}cornell.edu

ABSTRACT

Molecular genetic markers can be used to examine a group of individuals or populations to estimate various diversity measures and genetic distances, infer population structure and clustering patterns, test for Hardy-Weinberg and multilocus equilibrium, and test polymorphic loci for evidence of selective neutrality. This can be useful to plant breeders, germplasm managers, or others who are interested in population genetic properties of materials that they are working with. Many software programs for molecular population genetics studies have been developed for personal computers. Their easy access, implementation of sophisticated and powerful statistical techniques, and user-friendliness make them an attractive alternative to performing calculations on spreadsheets or by writing simpler programs for oneself. This review outlines the current major features of six popular programs (TFPGA, Arlequin, GDA, GENEPOP, GeneStrut, and POPGENE), including the types of data they handle, analyses they perform, and where they can be obtained (Table 1) . These particular programs were chosen because each can accommodate a variety of molecular marker types and perform many different types of analyses. Although there is much overlap in their functionality, each program has unique features to offer potential users.


View this table:
[in this window]
[in a new window]
 
Table 1 Major features of reviewed software programs

 
MOLECULAR GENETIC MARKERS have become increasingly available in a variety of plant species and will likely continue to do so (Westman and Kresovich, 1997). The utility of molecular genetic markers extends beyond mapping and fingerprinting experiments into population genetics, where allele frequencies rather than individuals serve as the focus of study. Allele frequency variation within a crop species is of interest to breeders because it is the raw material on which selection acts. It is assumed that a set of random, anonymous DNA markers will be representative of the genome, including the loci that code for phenotypic traits under selection. Studying variation at marker loci allows genetic classification of populations and can lend insight into their history and inherited changes during their improvement.

Software with which to analyze intraspecific genetic variation within the framework of evolutionary hypothesis testing includes TFPGA (Miller, 1997), Arlequin (Schneider et al., 1997), GDA (Lewis and Zaykin, 1999), GENEPOP (Raymond and Rousset, 1995a), GeneStrut (Constantine et al., 1994), and POPGENE (Yeh and Boyle, 1997) (see Tables 1 and 2) . The purpose of this paper is to describe briefly the attributes of these particular programs and to summarize the methods that each employs. The reviewed programs were chosen because they are broad in their application, i.e., they were designed for a variety of marker types and analyses. All are available for free on the Internet and can be run on Windows PCs and Apple Macintosh computers. There are many other good programs available and it was impossible to include them all. Additional links for evolutionary software and related resources are: http://wbar.uta.edu/software/software.htm (evolution and population genetics educational database; verified May 26, 2000), http://darwin.eeb.uconn.edu/evolution-sites.html#software (evolutionary biology software; verified May 26, 2000), and http://wwwvet.murdoch.edu.au/vetschl/imgad/GSLinks.htm (phylogenetic and population genetics links; verified May 26, 2000).


View this table:
[in this window]
[in a new window]
 
Table 2 Availability of reviewed software programs

 
Programs

Tools for Population Genetic Analyses (TFPGA) version 1.3 [Oct. 26, 1998]
Use of the Program; Data Types: Haploid or Diploid, Dominant or Codominant
Interactions are through menus and dialog boxes. The application window can be switched between data and results modes. Data can be imported as TFPGA-formatted text files, entered directly into the data window, or non-TFPGA formatted files can be opened and edited within the data window. Results are text based and can be modified within the results window. Graphic results are generated for UPGMA (a dendrogram) and the Mantel test (a scatterplot of matrix elements). The dendrogram can be saved as a bitmap file. The 30-page manual includes well-documented descriptions of methods, easy-to-follow instructions, explanations of minor flaws and common error codes, and four pages of references and suggested reading.

Analyses
The Analyze menu lists seven items: Descriptive Statistics, F-statistics, Genetic Distance, Hardy-Weinberg, UPGMA, Exact Tests, and Mantel Test.

The program will estimate allele frequencies for diploid dominant markers by one of two methods: as the square root of the frequency of the recessive genotype or by Taylor expansion (Lynch and Milligan, 1994).

Descriptive statistics include allele and heterozygote (but not genotype) frequencies, observed and expected heterozygosities (Nei, 1978), and percent polymorphic loci, with the option to obtain these estimates for any level of the user-defined hierarchy. F-statistics (Weir and Cockerham, 1984) are reported for each allele, each locus, and over all loci. Options for f (equivalent to FIS, the inbreeding coefficient within a subpopulation), F (equivalent to FIT, the overall inbreeding coefficient), and {theta} (equivalent to FST, the fixation index) estimates include jackknifing over loci to obtain standard deviations and bootstrapping over loci to provide 95% confidence intervals. Up to four hierarchical levels can be included in the F-statistics (individuals, subsubpopulations, subpopulations, and populations). Chi-square tests, G-tests, and exact tests (Guo and Thompson, 1992) for Hardy-Weinberg equilibrium are available, and these tests may be applied to any level of the hierarchy. There is an option for pooling according to genotype (homozygous for the most common allele, heterozygous for the most common allele, and all other genotypes). Various genetic distances and identities can be calculated for any level of the hierarchy—Nei's and Nei's minimum (original and unbiased, Nei, 1972, 1978), Rogers' (1972) and modified Rogers' (Wright, 1978, p. 91), and coancestry (Reynolds et al., 1983). Clustering analysis is by UPGMA (Sokal and Michener, 1958); bootstrap values and consistency indices (the number of loci that support each node) can be generated. Exact tests for population differentiation (Raymond and Rousset, 1995b) in terms of allele frequencies are available, with the option to assign populations or other groups from the hierarchy to one of two test groups. Each locus is analyzed for differences in allele frequencies between populations, and Fisher's Combined Probability test (Sokal and Rohlf, 1981, p. 779–782) uses evidence from all loci to test the null hypothesis of population differentiation.

A Mantel test is available. This uses a permutation procedure to look for a significant association between two distance matrices (e.g., genetic distance and geographic distance). The program estimates a correlation coefficient between two matrices, and also reports a Z-statistic (Mantel, 1967). Options include the log transformation of matrix elements.

Comments
For diploid codominant data the two alleles at a locus must be sorted. For allele designations, the program assumes that the largest value equals the number of alleles, and that the lower-numbered alleles are also contained in the data set. By means of the diploid, dominant marker type assumes that there are only two alleles at each locus; data are scored as presence versus absence of a band. This is a good program to consider for analyses of dominant markers and should satisfy most of the needs for analyzing codominant markers.

Arlequin version 1.1 [Dec. 17, 1997]
Use of the Program; Data Types: Haploid, Diploid with Known Gametic Phase, Diploid with Unknown Gametic Phase (with Recessive Alleles), Diploid with Unknown Gametic Phase (without Recessive Alleles), DNA Sequences, RFLP Haplotypes (Coded as Presence/Absence of Restriction Sites), Microsatellites, Allozymes, and Allele Frequency Data
Input data files are referred to as "project files." The user can create a project in any text editor or use the program's "project outline wizard," which opens a dialog box specifying essential project elements. Arlequin is able to convert data files to and from the formats Arlequin, GENEPOP (Raymond and Rousset, 1995a), BIOSYS (Swofford and Selander, 1981), PHYLIP (Felsenstein, 1993), MEGA (Kumar et al., 1993), and WinAmova (Excoffier et al., 1992).

Interactions are via the menu, the toolbar, or the launch pad, and dialog boxes. An input file containing settings (types of analyses and options concerning them) can be automatically created by selecting "save settings" within the application. These files are convenient for repeating analyses on various project files. More than one project file can be analyzed consecutively by running a batch file. This is simply a text file with one project file name per line. Different sets of analyses can be performed on different project files within a batch by using associated settings files. There are no interactive options to include/exclude loci or populations at runtime for analyses.

Results are deposited into a text file, or optionally, an HTML file that can be viewed with any web browser. Files formatted to be read by Microsoft Excel are output in two cases: for pairwise linkage disequilibria tests and for AMOVA (histograms of null distributions of variance components are generated).

The manual is 82 pages in length and includes highly detailed descriptions of methodology and four pages of references.

Analyses
Analyses are divided into four categories: Diversity Indices, Disequilibrium Tests, Neutrality Tests, and Population Structure.

Diversity indices include maximum-likelihood estimation of haplotype frequencies (Excoffier and Slatkin, 1995), maximum-likelihood estimation of allele frequencies with or without a recessive allele, expected allele frequencies under the infinite alleles model (Stewart, 1977), mean number of pairwise differences between haplotypes, and mismatch distribution (Rogers and Harpending, 1992). Other descriptive measures are number of haplotypes, observed and expected heterozygosity, observed homozygosity, number of polymorphic loci, number of alleles per locus, and allele frequencies. A distance matrix between all pairs of individuals can be generated. A wide variety of methods to estimate distances between molecular haplotypes, depending on marker type, are available (see Arlequin manual p. 55–61). Four different estimators of the population parameter {theta} = 4Nµ (where N equals effective population size and µ equals the neutral mutation rate) can be computed for nucleotide data. These are based on observed homozygosity (H, Chakraborty and Weiss, 1991), number of segregating sites (S, Watterson, 1975), number of alleles (k, Ewens, 1972), or mean number of pairwise differences (p, Tajima, 1983).

Pairwise linkage disequilibrium is measured by an exact test for haplotypic data (Slatkin, 1994a) or a likelihood ratio test for diploid data with unknown gametic phase (Slatkin and Excoffier, 1996). Classical measures of linkage disequilibria (Lewontin and Kojima, 1960; Lewontin, 1964) can also be calculated. Hardy-Weinberg exact tests are possible for diploid data with no recessive alleles. Genotypic data with known gametic phase can be tested at the haplotypic level (measuring the nonrandom association of haplotypes into individuals).

Neutrality tests based on an infinite-alleles model (suitable for diploid or haploid data) include Ewens-Watterson (Ewens, 1972; Watterson, 1978), Ewens-Watterson-Slatkin (Slatkin 1994b, 1996), and Chakraborty's (Chakraborty, 1990) test. Tajima's test (Tajima, 1989) of the infinite-sites model can be performed for DNA sequence or RFLP haplotype data.

Population structure by AMOVA is based on an analysis of variance of gene frequencies, taking into account the number of mutational differences between molecular haplotypes (Excoffier et al., 1992; Michalakis and Excoffier, 1996). The user defines the hierarchical group membership of the sampled individuals and the AMOVA partitions the total variance of the sample into its components, e.g., within individuals, among individuals within populations, among populations within groups, and among groups. The user's manual (p. 67–73) provides detailed descriptions of the AMOVA tables for various data types (haplotypic, genotypic, one or more groups of populations, individual level tested or not) and their variance components. All types of data can be handled except recessive alleles. Fixation indices (Weir and Cockerham, 1984) can be estimated and their significance tested by permutation procedures.

Population pairwise FST values can be computed and will be output in the form of a matrix. Three additional matrices based on these results - coancestry coefficients (Reynolds et al., 1983), linearized FST's (Slatkin, 1995), and M values (population size times the fraction of migrants per generation, Nm), are output. Their significance can be tested by permutation.

Finally, an exact test of population differentiation can be performed by testing the null hypothesis of the random distribution of haplotypes or genotypes among populations (Raymond and Rousset, 1995b).

Comments
The interface seems unreasonably complex upon initial inspection, but that is because Arlequin is very ambitious in what it has to offer. This program is useful for implementing some of the less common types of analyses, especially those involving haplotypic data.

Genetic Data Analysis (GDA) version 1.0 d13 [June 10, 1999]
Use of the Program; Data Types: Haploid or Diploid, Codominant
NEXUS (Maddison et al., 1997) file format is used. A text editor of the user's choice can be opened within the program to create files or edit existing ones. Importing and exporting of BIOSYS (Swofford and Selander, 1981), GeneStrut (Constantine et al., 1994), Weir (1990)(Appendix C), and exporting into these formats plus NEXUS, GeneStat-PC (Lewis and Whitkus, 1989), and SAS (SAS Institute, Inc., Cary, NC) formatted files is supported. Populations and loci can be included and excluded from the analyses at runtime.

Analyses are either menu and dialog box or command line driven from within the application window. The program can be run in batch mode by reading a sequence of instructions in a NEXUS GDA block (e.g., which files to read, which analyses to perform, where to store the results).

Results will appear in text form in the main window; this output can be simultaneously logged to a file. A visual rendering of the hierarchical structure used in computing F-statistics can be produced. Dendrograms are drawn using ASCII or line-drawing characters.

Documentation explains how to run the program but contains few details and references concerning the methods.

Analyses
Analyses types are organized into Descriptive Statistics, F-statistics, Distances, and Disequilibrium.

Descriptive statistics include mean sample sizes, percent polymorphic loci, mean number of alleles per locus, mean number of alleles per polymorphic locus, observed and expected heterozygosities, detection of private alleles, and FIS estimates for loci or alleles.

F-statistics options include the reporting of ANOVA components (sums of squares, mean squares, and variance components). Additional F-statistics options permit reporting results for individual alleles, assuming Hardy-Weinberg proportions or estimating inbreeding coefficients, and obtaining confidence intervals by bootstrapping over loci. Bootstrapping options include outputting a list of replicate estimates for f, F, and {theta}, that can be used to create a graphical representation of the distribution of results using another application such as Microsoft Excel. Jackknifing across populations to obtain variances of F-statistics estimates for individual loci requires a minimum of three populations. Genetic proximities can be estimated as identities or distances (Nei, 1972, 1978; Reynolds et al., 1983). Hardy-Weinberg and linkage disequilibrium are examined by exact tests (permutation), with options to infer missing data based on allele frequencies from the data at hand. Zygotic versus gametic disequilibrium can be distinguished in the tests because there is an option to break up or to preserve genotypes at loci. It is possible to test any numbers of combinations of loci (pairs, triplets, etc.), up to all loci considered jointly. Cluster analysis includes neighbor-joining (Saitou and Nei, 1987) and UPGMA (Sokal and Michener, 1958).

Comments
NEXUS format makes it easy to analyze subsets of data without altering the original data file. It is flexible and convenient in numerous other ways (Maddison et al., 1997) to an extent not found in the other programs reviewed here. GDA is elegant in its combining of a simple interface with statistical power and can satisfy most of the needs for those analyzing codominant markers.

GENEPOP version 3.1d [Mar. 1999]
Includes the programs LINKDOS (Garnier-Gere and Dillmann, 1992) and ISOLDE (Rousset, 1997).

Use of the Program; Data Types: Haploid or Diploid, Codominant
Input is through GENEPOP-formatted text files. The program lists analyses options in menu form on the DOS interface and the user selects from among these by typing a digit or a letter. Results do not appear on the screen but are output as text files. ISOLDE outputs a file that is formatted for plotting results (e.g., in Microsoft Excel) with the first column containing geographic distances and the second column containing FST estimates. Data files can also be exported into formats for FSTAT (Goudet, 1995), BIOSYS (Swofford and Selander, 1981), LINKDOS (Garnier-Gere and Dillmann, 1992), and one suitable for an ANOVA of heterozygosity (Weir, 1990, p. 124).

The manual is 31 pages in length and methods are thoroughly explained and well documented. The authors explicitly define the hypotheses being tested under the various options.

Analyses
The main menu is organized into three sections: Testing (Hardy-Weinberg, genotypic disequilibrium, and population differentiation), Estimating (Nm, allele frequencies, population structure and isolation-by-distance), and Ecumenicism (file conversion and various utilities).

P-values for Hardy-Weinberg equilibrium, linkage disequilibrium, and population differentiation are computed by exact tests, either by complete enumeration (Louis and Dempster, 1987) or by Markov chain methods (Guo and Thompson, 1992) to analyze contingency tables. The standard error of the P-value is reported for Markov chain methods. In addition to the standard Hardy-Weinberg test, the U-test for heterozygote excess or deficiency (Rousset and Raymond, 1995) is available. Global Hardy-Weinberg tests across loci or across populations are another option; global tests assume statistical independence of loci. Two estimates of FIS (Weir and Cockerham, 1984; Robertson and Hill, 1984) for each allele and over all alleles are included in Hardy-Weinberg results.

Linkage disequilibrium between pairs of loci can be tested for diploid or haploid data. A global test for each pair of loci across populations is also performed. Population differentiation can be tested at the gene level or the genotypic level over all populations or for all pairs of populations (Raymond and Rousset, 1995b; Goudet et al., 1996). GENEPOP also gives the option to analyze any kind of contingency table (with multiple rows and columns possible) stored in a file by computing an unbiased estimate of the exact P-value of the table (through the STRUC program, Raymond and Rousset, 1995b).

Estimation of population parameters include a multilocus estimate of the effective number of migrants (Nm, Slatkin, 1985; Barton and Slatkin, 1986), allele frequencies, observed and expected genotype proportions, observed and expected number of homozygotes and heterozygotes, and FIS estimates for each allele. Genotypic matrices can be generated. An option under the Ecumenicism menu will compute maximum likelihood estimates of allele frequencies in the presence of a null allele (Dempster et al., 1977). This requires at least one homozygous null individual in the population sample.

F-statistics or Rho-statistics (which take allele size into account) can be calculated for all populations or all pairs of populations (Cockerham, 1973; Weir and Cockerham, 1984; Michalakis and Excoffier, 1996). The computed matrix of FST (or RST) values can be tested for isolation-by-distance by the ISOLDE program if the user provides a matrix of geographical distances between all pairs of populations. ISOLDE computes Mantel tests and also regresses FST estimates to geographical distances (Rousset, 1997).

Comments
Results files are often very lengthy, and it is cumbersome to extract the information of interest, to the extent that I had to write programs to do so. Linkage disequilibrium and population differentiation analyses output include contingency tables. For Rho-statistics, alleles can either be named by their size or the name of each allele can be associated with its size.

GeneStrut [Sept. 4, 1998]
Includes the programs UPGMA2 (Constantine, 1998), LD86, and DIPLOID (Weir, 1990, Appendix C).

Use of the Program; Data Types: Diploid, Codominant
Input is through GeneStrut-formatted text files. Interaction is through the keyboard in response to a brief series of prompts appearing in a console window. All analyses are automatically performed and output into text files. Genetic distance matrices are output into files formatted as input into the clustering program UPGMA2 (Constantine, 1998). Files can be generated for input into Weir's programs LD86 and DIPLOID (Weir, 1990, Appendix C). The manual is brief (10 pages). Methodology is explained and well documented.

Analyses
Diversity statistics include allele frequencies, mean and variance of the number of alleles per locus, percent polymorphic loci, mean and variance of observed heterozygosity, and mean and variance of expected heterozygosity. Hardy-Weinberg equilibrium is tested by Chi-square and G-tests, including tests with the least common alleles pooled. Observed and expected (with and without Levene's correction, Levene, 1949) genotypic frequencies are reported. FIS with and without the least common alleles pooled is also reported, with the variance given for the pooled estimate. Nei's (1978) genetic distance and identity with variances (Nei, 1987, p. 226) and Rogers' (1972) genetic distance and similarity matrices are generated.

Population structure is examined in two ways. For example, say we have sampled a species of fish from several states across the USA. We can assign the hierarchical membership of individuals to ponds within towns, towns within counties, and counties within states. Diversity partitioning will be estimated (i) within and between groups at the first level of the hierarchy only (ponds), and (ii) between all levels. In the first case, for each locus and the mean of all loci, output is as follows: HO (total observed heterozygosity), HS (within-group expected heterozygosity), HT (total expected heterozygosity), and the F-statistics (FIS, FST, FIT). The deviation of FIS and FST from zero is tested by Chi-square. In the second case, for each locus and the mean of all loci, estimates are output forHT (total expected heterozygosity), and its components H1 (between individuals within ponds), D12 (between ponds within towns), D23 (between towns within counties), D34 (between counties within states), D4T (between states). The corresponding G-statistics, obtained by dividing each component byHT, are also given. These estimate the fraction of the total variation found at each level (Nei, 1987, p. 190–192).

Most of the analyses are performed for all levels of the hierarchy, successively pooling individuals at lower levels.

Comments
Macintosh users should consider this program first if they don't wish to run a PC-emulator. It will satisfy most of the needs for analyzing codominant markers. GeneStrut requires alleles to be numbered sequentially and will not generate an error message if this is violated.

POPGENE version 1.31 [Nov. 10, 1998]
Use of the Program; Data Types: Haploid or Diploid, Dominant, or Codominant
Input is through imported POPGENE-formatted text files or files created by the program's text editor. Analyses are through menus and dialog boxes. Results appear in an output window which can be saved as a text file or cut and pasted into another text editor. The manual is sketchy, methods are not explained in detail, and on-line help is presently not available.

Analyses
The main menu bar contains three items with which to initiate analyses: Co-Dominant, Dominant, and Quantitative (quantitative trait analysis cannot be implemented at this time).

Diversity statistics for haploid data include allele frequencies, number of alleles, effective number of alleles, percent polymorphic loci, expected heterozygosity, and the Shannon information index. Codominant diploid data includes these plus genotypic frequencies, and observed and expected homozygosity and heterozygosity. The program can estimate allele frequencies for dominant markers (Chong, Yang, and Yeh, 1994). The user can optionally specify an inbreeding coefficient (FIS value, e.g., previously obtained from codominant markers) in a dominant marker data set for a population that is not in Hardy-Weinberg equilibrium to be used when estimating allele frequencies.

Hardy-Weinberg equilibrium is examined by Chi-square and likelihood ratio tests by Levene's (1949) algorithm for computing expected genotypic frequencies. FIS estimates for each allele and over all alleles are reported. Two-locus linkage disequilibrium is tested by Chi-square for single populations and by two-locus analysis of population subdivision (D-statistics, Ohta 1982a, b) for multiple populations. Hardy-Weinberg and linkage disequilibria can also be tested by a method that treats the data as a two-allele system (most frequent allele plus "other") (Smouse and Neel, 1977; Smouse et al., 1983; Yang and Yeh, 1993).

Population structure is estimated by G-statistics for haploid data (Nei, 1987, p. 190–192) and F-statistics (Nei, 1987, p. 159–166) for diploid data. A multilocus method of examining population structure for haploid data is also available (Brown et al., 1980; Brown and Feldman, 1981). Gene flow (Nm) is estimated from GST or FST (Slatkin and Barton, 1989). Population homogeneity tests can be carried out by Chi-square and likelihood ratio tests on two-way contingency tables of allele frequencies across populations. Nei's genetic distances and identities (Nei, 1972, 1978) can be estimated between groups or populations, and a dendrogram can be generated based on a UPGMA analysis (Sokal and Michener, 1958) of the distance matrix.

The Ewens-Watterson test for neutrality (Ewens, 1972; Watterson, 1978) can be carried out on haploid or diploid data.

Comments
For input files, there are no identifiers for individual samples, which makes editing the files confusing for large data sets. Before starting analyses the user is prompted to include or exclude loci and populations, and define groups of populations. These are not retained and must be specified before each run.

Test of the programs

I tested all six programs using a published (Weir, 1990, p. 338–339) data set of 44 individuals sampled from six populations and genotyped for five loci using diploid codominant markers. The purpose of the test was to assess whether or not the programs yielded the same or similar results rather than to detect software bugs or erroneous computations, which the authors have done extensively themselves and through user feedback.

Of the various types of analyses, diversity measures were the most congruent between programs. A potential cause of confusion in all types of analyses is the use of different terms for the same measure, and the converse of this. For example, comparing TFPGA and POPGENE, there is perfect correspondence between the former's "heterozygosity" with the latter's "Nei's 1973 expected heterozygosity;" "heterozygosity (unbiased)" with "expected heterozygosity (Levene, 1949);" and "heterozygosity (direct count)" with "observed heterozygosity." Arlequin's output for expected heterozygosity was very much inflated compared with estimates from the other programs. The manual stated that gene diversity is equivalent to expected heterozygosity for diploid data and further explained that "it is defined as the probability that two randomly chosen haplotypes are different in the sample." Gene diversity is usually computed on a per locus basis rather than entire haplotypes (although this is perfectly valid for some purposes); this explained the discrepancy between programs.

Hardy-Weinberg tests gave very similar results between programs. The greatest disparities depended on the type of significance test performed. The answer to whether or not the null hypothesis should be rejected varied not only between programs but also within a program. For example, for Hardy-Weinberg equilibrium POPGENE outputs Chi-square and likelihood ratio (G-test) results, and GeneStrut gives identical results but goes one step further, reporting results with and without the least common alleles pooled. Any one or more of these tests can be significant for a particular locus. The exact tests for Hardy-Weinberg equilibrium by Arlequin, GENEPOP, GDA, and TFPGA were generally more conservative than traditional Chi-square and G-tests. A particular P-value was often marginally significant (P < 0.10) using an exact test when the corresponding Chi-square test was significant (P < 0.05). Results for exact tests between programs were closely identical to the nearest 0.01.

Exact tests for linkage disequilibrium between programs were much more variable. Occasionally a statistically significant result in one program was not even marginally significant in another, possibly because iterative procedures were converging on different solutions. This may have been rectified by different settings in the interface but there were no warnings or indications in the output files (e.g., standard error greater than 0.01 for an associated P-value in GENEPOP or Arlequin) that analyses should be rerun with different settings.

Analyses of population structure also gave a variety of results between programs. GENEPOP, TFPGA, and GDA gave equivalent results to the nearest 0.01 for F-statistics, including 95% confidence intervals. Results from other programs differed from these but were in general agreement, although any particular value did not always fall within the 95% confidence intervals of the estimates of GENEPOP, TFPGA, and GDA. This was a consequence of different methodologies.

Genetic distances agreed between programs when the same method was used and were frequently identical to the nearest 0.01, but results from clustering analyses based on the same distance matrices varied. This is not unexpected as the inferred tree was not unique.

Overall, there were enough differences between outputs of the six programs to warrant caution when comparing results of any type of analysis between published studies. Even if the method is cited as identical, its implementation can vary enough in its details to give quite different outcomes. Misinterpretations within or between data sets will certainly arise in the absence of a high degree of scrutiny of the findings. Results from exact tests should be compared with results from traditional parametric methods whenever possible.

Conclusions

In general, these programs grew out of an individual's or a lab's immediate research needs and were developed into user-friendly software to share with the larger research community. The developers have expended a lot of time and effort to create, distribute, and regularly improve on the software at no cost to the users. The most sensible thing for a user is to test whatever programs seem appropriate and decide which ones they prefer. This can be done efficiently by running a given package's sample input files. The Windows- and DOS-based programs all performed very well on a PowerMac with a PC emulator (Softwindows95), although, in general, analyses involving exact tests are impractical because they are too time consuming. In this case, an authentic Windows or DOS environment is highly preferable.

Frequently, more time is required to create properly formatted input files than to perform any particular analysis on a set of data. Some of the programs may be more attractive than others on this basis (e.g., TFPGA and GeneStrut require alleles to be numbered sequentially). A good manual is critical to the user and can be as valuable a resource as the software itself. It will ideally contain a table of contents and index, be thoroughly referenced, and clearly explain tests and hypotheses. Unfortunately most of the manuals did not meet these standards. There are many details concerning the interface that contribute to ease of use, such as a warning message when exiting the program, and the ability to save certain settings and to run batch files.Nei 1973

ACKNOWLEDGMENTS

The author gratefully acknowledges Dr. Peter Bretting for his suggestion to undertake this review and Dr. Matthias Frisch for helping to improve the manuscript.

Received for publication December 14, 1999.

REFERENCES




This article has been cited by other articles:


Home page
Crop Sci.Home page
P. Sulima, J. A. Przyborowski, and D. Zaluski
RAPD Markers Reveal Genetic Diversity in Salix purpurea L.
Crop Sci., May 11, 2009; 49(3): 857 - 863.
[Abstract] [Full Text] [PDF]


Home page
Appl. Environ. Microbiol.Home page
L. Bui Thi Ngoc, C. Verniere, P. Jarne, S. Brisse, F. Guerin, S. Boutry, L. Gagnevin, and O. Pruvost
From Local Surveys to Global Surveillance: Three High-Throughput Genotyping Methods for Epidemiological Monitoring of Xanthomonas citri pv. citri Pathotypes
Appl. Envir. Microbiol., February 15, 2009; 75(4): 1173 - 1184.
[Abstract] [Full Text] [PDF]


Home page
Crop Sci.Home page
S. A. Mohammadi and B. M. Prasanna
Analysis of Genetic Diversity in Crop Plants--Salient Statistical Tools and Considerations
Crop Sci., July 1, 2003; 43(4): 1235 - 1248.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via ISI Web of Science (12)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Labate, J. A.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Labate, J. A.
Agricola
Right arrow Articles by Labate, J. A.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
The SCI Journals Agronomy Journal Vadose Zone Journal
Journal of Natural Resources
and Life Sciences Education
Soil Science Society of America Journal
Journal of Plant Registrations Journal of
Environmental Quality
The Plant Genome