Genetics Genetic Epidemiology Evolution
Biostatistics HLA MHC Glossary
Homepage
BASIC POPULATION GENETICS
M.Tevfik Dorak, M.D., Ph.D.
G.H. Hardy (the English
mathematician) and W. Weinberg (the German physician) independently worked out
the mathematical basis of population genetics in 1908 (Hardy, 1908). Their formula predicts
the expected genotype frequencies using the allele frequencies in a diploid
Mendelian population. They were concerned with questions like "what
happens to the frequencies of alleles in a population over time?" and
"would you expect to see alleles disappear or become more frequent over
time?"
Hardy and Weinberg showed
in the following manner that if the population is very large and random mating
is taking place, allele frequencies remain unchanged (or in equilibrium) over
time unless some other factors intervene. If the frequencies of allele A and a
(of a biallelic locus) are p and q, then (p + q) = 1. This means (p + q)2
= 1 too. It is also correct that (p + q)2 = p2 + 2pq +q2
= 1. In this formula, p2 corresponds to the frequency of homozygous
genotype AA, q2 to aa, and 2pq to Aa. Since 'AA, Aa, aa' are the
three possible genotypes for a biallelic locus, the sum of their frequencies
should be 1. In summary, Hardy-Weinberg formula shows that:
p2 + 2pq + q2
= 1
AA ...
Aa .. aa
If the observed frequencies
do not show a significant difference from these expected frequencies, the
population is said to be in Hardy-Weinberg equilibrium (HWE). If not, there is
a violation of the following assumptions of the formula, and the population is
not in HWE.
The assumptions of HWE
1. Population size is
effectively infinite,
2. Mating is random in the
population (the most common deviation results from inbreeding),
3. Males and females have
similar allele frequencies,
4. There are no mutations
and migrations affecting the allele frequencies in the population,
5. The genotypes have equal
fitness, i.e., there is no selection (in viability and fitness).
The Hardy-Weinberg law
suggests that as long as the assumptions are valid, allele and genotype
frequencies will not change in a population in successive generations. Thus,
any deviation from HWE may indicate the
following biological processes:
1. Small population size
results in random sampling errors and unpredictable genotype frequencies (a
real population's size is always finite and the frequency of an allele may
fluctuate from generation to generation due to chance events),
2. Assortative mating which
may be positive (increases homozygosity; self-fertilization is an extreme
example) or negative (increases heterozygosity), or inbreeding which increases
homozygosity in the whole genome without changing the allele frequencies.
Rare-male mating advantage also tends to increase the frequency of the rare
allele and heterozygosity for it (in reality, random mating does not occur all
the time). Cryptic population stratification is another reason for departure
from HWE.
3. A very high mutation
rate in the population (typical mutation rates are < 10-5 per
generation) or massive migration from a genotypically different population
interfering with the allele frequencies,
4. Selection of one or a
combination of genotypes (selection may be negative or positive). Selective
elimination of homozygotes as in some autosomal dominant diseases, where
homozygotes for the mutation may die in utero, is an example (in a very large
sample, this could violate HWE). Similar to this selection, sampling error
(selection bias) may also affect HWE if bias concerned ethnicity.
5. Unequal transmission
ratio (transmission ratio distortion or segregation distortion) of alternative
alleles from parents to offspring (as in mouse t-haplotypes),
6. Differential gene
frequency among males and females,
In most population genetic
estimations (like linkage disequilibrium calculations), HWE is assumed. This
means that genotype probabilities are determined by allele frequencies and
nothing interferes with that. If this assumption is not met, the estimations
will not be accurate. When HWE is assumed, this means that genotype
probabilities are determined by allele frequencies, i.e., there is no
transmission ratio distortion, selection against a genotype (lethality) etc. If
HWE is violated, statistical methods using allele frequencies may not be valid
and methods that use genotype frequencies should be preferred (Xu, 2002) (discussed below). See HWE
simulation in EvoTutor
for the effects of selection, mutation,
migration, genetic drift and assortative mating.
The implications of the HWE
1. The allele frequencies
remain constant from generation to generation. This means that hereditary
mechanism itself does not change allele frequencies. It is possible for one or
more assumptions of the equilibrium to be violated and still not produce
deviations from the expected frequencies that are large enough to be detected
by the goodness of fit test,
2. When an allele is rare,
there are many more heterozygotes than homozygotes for it. Thus, rare alleles
will be impossible to eliminate even if there is selection against homozygosity
for them,
3. For populations in HWE,
the proportion of heterozygotes is maximal when allele frequencies are equal (p
= q = 0.50), and when this happens the heterozygote frequency will be 0.50
(2x0.50x0.50). Unless HWE is violated (as in selective loss of homozygotes),
heterozygosity can never be more than 0.50 at any biallelic locus (see Figure in Prof Whitloc's Review).
4. An application of HWE is
that when the frequency of an autosomal recessive disease (e.g., sickle cell
disease, hereditary hemochromatosis, congenital adrenal hyperplasia) is known
in a population and unless there is reason to believe HWE does not hold in that
population, the gene frequency of the disease gene can be calculated. See also Clinical
Genetics. Likewise, the carrier rate may be calculated
for autosomal recessive disorders if the disease gene frequency is known. For
example, phenylketonuria (PKU) occurs in 1/11,000 (q2), which gives a heterozygote carrier frequency
of approximately 1/50 [ 2xq(1-q) ]. If the diseased individuals (q2) are deducted from the whole population, the
carrier rate in normal individuals approximates to [ 2q/1+q ].
It has to be remembered
that when HWE is tested, mathematical thinking is necessary. When the
population is found in equilibrium, it does not necessarily mean that all
assumptions are valid since there may be counterbalancing forces. Similarly, a
significant deviance may be due to sampling errors (including Wahlund effect,
see below and Glossary), misclassification of genotypes, measuring two
or more systems as a single system, failure to detect rare alleles and the
inclusion of non-existent alleles. The Hardy-Weinberg laws rarely holds true in
nature (otherwise evolution would not occur). Organisms are subject to
mutations, selective forces and they move about, or the allele frequencies may
be different in males and females. The gene frequencies are constantly changing
in a population, but the effects of these processes can be assessed by using
the Hardy-Weinberg law as the starting point.
The direction of departure
of observed from expected frequency cannot be used to infer the type of
selection acting on the locus even if it is known that selection is acting. If
selection is operating, the frequency of each genotype in the next generation
will be determined by its relative fitness (W). Relative fitness is a measure
of the relative contribution that a genotype makes to the next generation. It
can be measured in terms of the intensity of selection (s), where W = 1 - s [0 £
s £ 1]. The frequencies of each genotype after selection
will be p2 WAA, 2pq WAa, and q2 Waa.
The highest fitness is always 1 and the others are estimated proportional to
this. For example, in the case of heterozygote advantage (or overdominance),
the fitness of the heterozygous genotype (Aa) is 1, and the fitnesses of the
homozygous genotypes negatively selected are WAA = 1 - sAA
and Waa = 1 - saa. It can be shown mathematically that
only in this case a stable polymorphism is possible. Other selection forms,
underdominance and directional selection, result in unstable polymorphisms. The
weighted average of the fitnesses of all genotypes is the mean fitness. It is
important that genetic fitness is determined by both fertility and viability.
This means that diseases that are fatal to the bearer but do not reduce the number
of progeny are not genetic lethals and do not have reduced fitness (like the
adult onset genetic diseases: Huntington's chorea, hereditary hemochromatosis).
The detection of selection is not easy because the impact on changes in allele
frequency occurs very slowly and selective forces are not static (may even vary
in one generation as in antagonistic pleiotropy).
All discussions presented
so far concerns a simple biallelic locus. In real life, however, there are many
loci which are multiallelic, and interacting with each other as well as with
the environmental factors. The Hardy-Weinberg principle is equally applicable
to multiallelic loci but the mathematics is slightly more complicated. For
multigenic and multifactorial traits, which are mathematically continuous as
opposed to discrete, more complex techniques of quantitative genetics are
required.
In a final note on the
practical use of HWE, it has to be emphasized that its violation in daily life is most
frequently due to genotyping errors (Gomes, 1999;
Lewis, 2002).
Allelic misassignments, as frequently happens when PCR-SSP method is used, sometimes due to allelic dropout (increased homozygosity) are the most
frequent causes of Hardy-Weinberg disequilibrium. When this is observed, the
genotyping protocol should be reviewed. Another common reason is the presence
of an unknown allele which is not considered in the genotyping scheme (null
allele). This happens when a variant is considered to be bi-allelic while it is
actually multiallelic. In a case-control association study, it is of paramount
importance that the control group is in HWE to rule out any technical errors
and to avoid false-positive associations (Tiret, 1995; Schaid & Jacobsen, 1999). The
violation of HWE in the case group, however, may be due to a real association (Nielsen, 1998;
Lee, 2003; Czika & Weir, 2004). When
Hardy-Weinberg disequilibrium is not due to a technical error, the statistical
evaluation of the data should involve statistical tests using genotype
frequencies rather than allele frequencies (Xu, 2002; Lewis, 2002).
One such test is the trend test (to estimate common odds ratio) and can be run
online for SNP data (Online HWE
& Association Testing). Departure from proportions can bias estimates of estimated
haplotype frequencies from population data. Departure from HWE may
be a substantial source of error in Expectation-Maximization
haplotype frequency estimation, simply because the algorithm relies
on HWE in its 'expectation' step (
See Testing for Consistency with HW Proportions
at GENESTAT
Statistical Genetics Pages; PEDSTATS HWE Tutorial; HWE Slide Show.
Some concepts relevant to HWE
Wahlund effect: Reduction in observed heterozygosity (increased
homozygosity) because of pooling discrete subpopulations with different allele
frequencies that do not interbreed as a single randomly mating unit. When all
subpopulations have the same gene frequencies, no variance among subpopulations
exists, and no Wahlund effect occurs (FST = 0).
F statistics: The F statistics in population genetics has
nothing to do the F statistics evaluating differences in variances. Here F
stands for fixation index, fixation being increased homozygosity resulting from
inbreeding. Population subdivision results in the loss of genetic variation
(measured by heterozygosity) within subpopulations due to their being small
populations and genetic drift acting within each one of them. This means that
population subdivision would result in decreased heterozygosity relative to
that expected heterozygosity under random mating as if the whole population was
a single breeding unit. Wright developed three fixation indices to evaluate
population subdivision: FIS (interindividual), FST
(subpopulations), FIT (total population).
FIS is a measure of the deviation of genotypic
frequencies from panmictic frequencies in terms of heterozygous deficiency or
excess. It is what is known as
FIT is rarely used. It is the overall inbreeding
coefficient (F) of an individual relative to the total population
(Individual within the Total population).
See also a Lecture Note on Analysis of Molecular
Variance (AMOVA) is a method of estimating population differentiation directly
from molecular data (and Genetic Epidemiology Glossary).
Detecting Selection Using DNA Polymorphism Data
Several methods have
been designed to use DNA polymorphism data (sequences and allele frequencies)
to obtain information on past selection events. Most commonly, the ratio of non-synonymous (replacement) to
synonymous (silent) substitutions (dN/dS ratio; see below) is used as evidence for
overdominant selection (balancing selection) of which one form is heterozygote
advantage. Classic example of this is the mammalian MHC
system genes and other compatibility
systems in other organisms: the self-incompatibility
system of the plants, fungal mating types and invertebrate allorecognition
systems. In all these genes, a very high number of alleles is also noted. This
can be interpreted as an indicator of some form of balancing (diversifying)
selection. In the case of neutral polymorphism, one common allele and a few
rare alleles are expected. The frequency distribution of alleles is also
informative. Large number of alleles showing a relatively even distribution is
against neutrality expectations and suggestive of diversifying selection.
Most tests detect selection by rejecting
neutrality assumption (observed data is deviate significantly from what is
expected under neutrality). This deviation, however, may also be due to other
factors such as changes in population size or genetic drift. The original
neutrality test was Ewens-Watterson homozygosity test of
neutrality (see Glossary)
based on the comparison of observed homozygosity and predicted value calculated
by Ewens's sampling formula, which uses
the number of alleles and sample size. This test is not very powerful.
Other commonly used statistical tests of
neutrality are Tajima's D (theta), Fu & Li's D, D* and F. Tajima's test (Tajima, 1989) is based on the fact
that under the neutral model estimates of the number of segregating/polymorphic
sites and of the average number of nucleotide differences are correlated. If the
value of D is too large or too small, the neutral 'null' hypothesis is
rejected. DnaSP
calculates the D and its confidence limits (two-tailed test). Tajima did not
base this test on coalescent but Fu and Li's tests (Fu & Li, 1993) are directly based
on coalescent. The tests statistics D and F require data from intraspecific
polymorphism and from an outgroup (a sequence from a related species), and D*
and F* only require intraspecific data. DnaSP uses the critical values
obtained by Fu & Li, 1993 to determine the
statistical significance of D, F, D* and F* test statistics. DnaSP
can also conduct the Fs test statistic (Fu, 1997). The results of this group
of tests (Tajima's D and Fu & Li's tests) based on allelic variation and/or
level of variability may not clearly distinguish between selection and
demographic alternatives (bottleneck, population subdivision) but this problem
only applies to the analysis of a single locus (demographic changes affect all
loci whereas selection is expected to be locus-specific which are
distinguishable if multiple loci are analyzed). Tests for multiple loci include
the HKA test described by Hudson et al (1987). This test is based on the idea
that in the absence of selection, the expected number of polymorphic
(segregating) sites within species and the expected number of 'fixed'
differences between species (divergence) are both proportional to the mutation
rate, and the ratio of them should be the same for all loci. Variation in the
ratio of divergence to polymorphism among loci suggests selection.
A different group of neutrality tests that are
not sensitive to demographic changes include McDonald-Kreitman test (McDonald & Kreitman, 1991) and dN/dS ratio test. McDonald-Kreitman test compares the ratio of the
number of nonsynonymous to synonymous 'polymorphisms' within species to that
ratio of the number of nonsynonymous to synonymous 'fixed' differences between
species in a 2x2 table. The most direct method of showing the presence of
positive selection is to compare the number of nonsynonymous (dN) to the number of synonymous (dS)
substitutions in a locus. A high (>1) value of (dN/dS) substitutions
suggest fixation of nonsynonymous mutations with a higher probability than
neutral (synonymous) ones. Statistical properties of this test are given by Goldman & Yang, 1994 and by Muse & Gaut, 1994. The dN/dS
ratio tests take into account of transition/transversion rate bias and codon
usage bias.
For other tests and software to perform these statistics,
see DNA Sequence Polymorphism, DnaSP. See also: Statistical
Tests of Neutrality (Lecture Note by P Beerli); Statistical Tests of Neutrality of Mutations against Excess
of Recent Mutations (Rare Alleles); Statistical Tests of Neutrality of Mutations against an
Excess of Old Mutations or a Reduction of Young Mutations;
Estimation
of theta;
Innan & Tajima, 2002; Properties of statistical tests of neutrality for DNA
polymorphism data, Simonsen, 1995. Review of statistical tests of selective neutrality
on genomic data, Nielsen, 2001 and a Lecture Note by Gil McVean. See also a review by SP Otto (Detecting
the Form of Selection from DNA Sequence Data. TIG 2000).
Linkage
disequilibrium (LD)
The
tendency for two 'alleles' to be present on the same chromosome (positive LD),
or not to segregate together (negative LD). As a result, specific alleles at
two different loci are found together more or less than expected by chance. LD
is the nonindependence, at a population level, of the alleles carried at
different positions in the genome. In this case, the expected frequency of a
two-locus haplotype can be calculated as the probability of the occurrence of
two independent (or joint) events simply by multiplying their gene frequencies.
The same situation may exist for more than two alleles. Its magnitude is
expressed as the delta (D) value and corresponds to the difference between the expected and the
observed haplotype frequency. If there is no LD, D will be zero (or not significantly
different from zero), if there is positive LD it will be a positive value. It
can also be negative if the two alleles tend not to occur together. The
statistical significance of LD, which depends on the sample size, and the
magnitude of LD are separate issues. The statistical significance is determined
by usually Fisher test and the magnitude is determined by either D value or alternative
measures. The magnitude can be normalized (for allele frequencies) to have the
same range of values for any frequency.
Ideally,
the haplotype frequencies should be calculated from family typing data.
Obviously, this gives the most accurate results. In practice, however, when
family data are not available, D and two-locus haplotype frequencies are
calculated from a sample of the population data by constructing 2x2 contingency
tables for each allele pair. A contingency table for this purpose contains the
individual (observed) values cross-classified by levels in two different
attributes. A common 2x2 table constructed in genetic studies is as follows:
|
|
|
|
allele
i |
|
|
|
|
|
Present (+) |
|
Absent (-) |
Row totals |
|
|
Present (+) |
a (+/+) |
|
b (+/-) |
a+b |
|
allele
j |
|
|
|
|
|
|
|
Absent (-) |
c (-/+) |
|
d (-/-) |
c+d |
|
|
Column totals |
a+c |
|
b+d |
N=a+b+c+d |
Counts
for each combination of levels (presence or absence) of the two factors
(alleles) are placed in each cell. The corresponding Dij is estimated by the formula (usually in HLA studies):
Dij = (d/N)1/2 –
[((b+d)/N)((c+d)/N)]1/2 (originally described by Bodmer & Bodmer, 1970; see also Schipper,
1998)
The
haplotype frequency (HFij) equals to GFi x GFj + Dij, where GF is the gene frequency (the proportion of the chromosomes
carrying a particular allele). The haplotype frequency calculated with this
formula from the population data compares reasonably well with the estimates
obtained directly from counting haplotypes constructed from family segregation
data. This method generates a reliable estimate of a haplotype frequency with
the exception of very small haplotype frequencies. Also for other parts of the
genome, it has been reported that there is little or no advantage to
constructing haplotypes from family data rather than unrelated individuals. The
major point is that when using population data, genotyping errors become an
issue. When genotyping a large number of markers, an error rate of only 1% will
produce a large number of inaccurate haplotypes. Genotyping errors are not the
only possible sources of accuracy problems. Other factors include sample size,
allele frequency distributions and departures from HWE.
There
are other measures of LD. Because the value of D depends on allele
frequencies a normalization of D is needed. This is achieved by taking into account the allele
frequencies: normalized delta value (D') = DAB / Dmax.
Dmax is the lesser of pApb
or papB if D is
positive or pApB or papb if D is negative. Because the sign is arbitrary, | D' | is often used rather
than D'. Therefore, D'
(normalized LD) is scaled to remove allele frequency effects. In a large enough
sample, D' = 1 indicates complete LD and D' = 0 corresponds to no LD. |D'| is
directly related to recombination fraction and its generalization to more than
two loci is the only measure of LD not sensitive to allele frequencies. HAPLOVIEW and MIDAS are some of
the software that calculate
D' values.
Another statistic of linkage
disequilibrium is the square of the correlation coefficient (r2) between the alleles at locus A and B: r2 = D2/ (pA pa pB pb) which can also be expressed as r2 = D2 / (pA (1-pA) pB (1-pB)) (for two loci with two alleles each). The measure r2 has several properties that make it more useful (Pritchard, 2001; Weiss, 2002; Carlson, 2004).
In brief, for low allele frequencies r2 has more reliable sample
properties than |D'|. The allelic
association metric r (rho) has the
strongest population theory basis, least sensitive to marker allele frequencies
(Morton,
2001); and has several statistically optimal properties
such as consistency, asymptotic unbiasedness and asymptotic efficiency (Shete, 2003). This parameter can be
estimated on MIDAS
for SNP data (along with other measures of LD). (See
CIGMR; GENESTAT LD Measures; CGIL Summer Course Notes and Measures of LD by Hedrick,
1987; Lewontin,
1988; Devlin
& Risch, 1995; Devlin,
1996; Morton,
2001; Wall & Pritchard, 2003; Shete, 2003; Jorde, 2003; Mueller, 2004; Zaykin, 2004; Wang, 2005 and
a Lecture Note at UCL for further details.)
Statistical Methods for LD
Estimation: While the Mattiuz formula can be used to calculate two-locus haplotype frequencies manually, an
alternative method, the maximum likelihood estimation (achieved by EM ' expectation-maximization' algorithm), can be used if computing facilities are available.
This test was originally described by Yasuda and Tsuji (1975), compared with other methods by
Schipper et al (1998)
and Excoffier et al. (1995). It can also be used for
multiple-locus haplotypes (Long, 1995). One of the most
sophisticated population genetic data analysis packages ARLEQUIN as well as EMLD use
EM algorithm to calculate multilocus LD. Some of the other software to perform
LD analysis are: MIDAS, HAPLOVIEW, Genetic Data Analysis (GDA), EH,
MLD,
DISEQ and PopGene. The proprietary LD Software
HelixTree
(manual) computes multilocus haplotype
probabilities using the composite haplotype method (CHM) and the Expectation
Maximization (EM) algorithm for SNP data. LD analysis can also be performed online (Online LD Analysis; Genotype2LDBlock; VG2). All of the above (pairwise) LD
measures can be estimated from unphased SNP data on STATA using David Clayton's
program pwld within genassoc.pkg. For HLA data, HWE and LD (as well as Ewens-Watterson
test) can be assessed using PyPop.
Interpretation
of LD Data: The patterns of LD observed in natural
populations are the result of a complex interplay between genetic factors and
the population's demographic history (Pritchard, 2001). LD is usually a
function of distance between the two loci. This is mainly because recombination
acts to break down LD in successive generations (Hill, 1966). When a mutation first
occurs it is in complete LD with the nearest marker (D' = 1.0). Given enough
time and as a function of the distance between the mutation and the marker, LD
tends to decay and in complete equilibrium reached D' = 0 value. Thus, it
decreases at every generation of random mating unless some process is opposing
to the approach to linkage 'equilibrium'. However, physical distance could
account for less than 50% of the observed variation in LD. One genetic
phenomenon that affects LD is gene conversion. Gene conversion is an important
mechanism in the breaking down of allelic associations over short distances,
i.e., decay of LD. Other factors that influence LD include changes in
population demographics (such as population growth, bottlenecks, geographical
subdivision, admixture and migration) and selective forces. Admixture (intermixture
of populations) would cause LD if the mixing populations have different allele
frequencies. LD will also be erased faster in large populations than in small
ones (chance in small populations maintain LD). Permanent LD may result from
natural selection if some gametic combinations confer higher fitness than other
combinations. An extraordinary example of the effect of recombination rates on
LD is the discrepancy between genetic distance and physical distance between
HFE and HLA-A, which generates strong LD despite 5Mb distance (Malfroy,
1997). Regional LD may also be variable according to
haplotype. An example has been presented for HLA haplotypes. Haplotype-specific
patterns of LD (Ahmad, 2003) may reflect
haplotype-specific recombination hotspots as has been shown for mouse MHC.
Note
that LD has nothing to do HWE and should not be confused with it (see Possible
Misunderstandings in Genetics).
Genetic distance (GD)
Genetic distance is a measurement of
genetic relatedness of samples of populations (whereas genetic diversity
represents diversity within a population). The estimate is based on the number
of allelic substitutions per locus that have occurred during the separate
evolution of two populations. (See lecture notes on Genetic Distances, Estimating
Genetic Distance; and GeneDist: Online
Calculator of Genetic Distance. The software Arlequin v3.01, PHYLIP, GDA, PopGene, Populations and SGS are suitable to calculate population-to-population
genetic distances from allele frequencies (Microsat is a microsatellite distance program).
Genetic
Distance can be computed on freeware PHYLIP. Most components of PHYLIP are
available on the web (or on Pasteur Webserver). One component of
the package GENDIST estimates genetic distance from allele
frequencies using one of the three methods: Nei's, Cavalli-Sforza's or
Reynold's (see papers by Cavalli-Sforza & Edwards, 1967, Nei, 1983, Nei, 1996 and lecture note (1) and (2) for more information on these methods).
GENDIST can be run online using default options (Nei's genetic distance) to obtain genetic
distance matrix data. The PHYLIP program CONTML estimates
phylogenies from gene frequency data by maximum likelihood under a model in
which all divergence is due to genetic drift in the absence of new mutations (Cavalli-Sforza's method) and draws a tree. The
program is also available on the web and runs with default options. If
new mutations are contributing to allele frequency changes, Nei's method should
be selected on GENDIST to estimate genetic distances first. Then a tree can be
obtained using one of the following components of PHYLIP: NEIGHBOR also draws a phylogenetic tree using
the genetic distance matrix data (from GENDIST). It uses either Nei and Saitou's (1987) "Neighbor Joining (NJ) Method," or the UPGMA (unweighted pair group method with
arithmetic mean; average linkage clustering) method (Sneath & Sokal, 1973). Neighbor Joining is a distance matrix method
producing an unrooted tree without the assumption of a clock (the evolutionary
rate does not have to be the same in all lineages). Major assumption of UPGMA
is equal rate of evolution along
all branches (which is frequently unrealistic). NEIGHBOR
can be run online. Other components of PHYLIP that draw
phylogenetic trees from genetic distance matrix data are FITCH / online (Fitch-Margoliash method with no assumption of equal evolutionary rate) and KITSCH / online (employs Fitch-Margoliash and Least
Squares methods with the assumption that all tip species are contemporaneous,
and that there is an evolutionary clock -in effect, a molecular clock) (Mathematical Formulae of Various Genetic Distance Measures;
Genetic Distance Equations). Another
freeware PopGene calculates Nei's genetic distance and creates a tree using UPGMA method
from genotypes. For genetic distance calculation on Excel, try freeware GenAlEx
by Peakall & Smouse.
Because of different
assumptions they are based on the NJ and UPGMA methods may construct
dendrograms with totally different topologies. For an example of this and a
review of main differences between the two methods, see Nei & Roychoudhury, 1993 (PDF). Both methods use distance
matrices (also Fitch-Margoliash and Minimal Evolution methods are distance
methods). The principle difference between NJ and UPGMA is that NJ does not
assume an equal evolutionary rate for each lineage. Since the constant rate of
evolution does not hold for human populations, NJ seems to be the better
method. For the genetic loci subject to natural selection, the evolutionary
rate is not the same for each population and therefore UPGMA should be avoided
for the analysis of such loci (including the HLA genes). The leading group in
HLA-based genetic distance analysis led by Arnaiz-Villena proposes that the most
appropriate genetic distance measure for the HLA system is the DA value first
described by Nei, 1983.
Unlike UPGMA, NJ produces an unrooted tree. To find the root of the tree, one
can add an outgroup. The point in the tree where the edge to the outgroup joins
is the best possible estimate for the root position. One persistent problem
with tree construction is the lack of statistical assessment of the
phylogenetic tree presented. This is best done with widely available bootstrap
analysis originally described in Felsenstein J: Evolution 1985;39:783-791
(available through JSTOR if
you have access) and Efron, 1996; and reviewed in (Nei, 1996). For a discussion of
statistical tests of molecular phylogenies, see Li & Gouy, 1990 and Nei, 1996. For the topology to be
statistically significant the bootstrap value for each cluster should reach at
least 70% whereas 50% overestimates accuracy of the tree. Bootstrap tests
should be done with at least 1000 (preferably more) replications.
Nei
noted that some genes are more suitable than others in phylogenetic inference
and that most tree-building methods tend to produce the same topology whether
the topology is correct or not (Nei M, 1996). He also added that sometimes adding one
more species/population would change the whole tree for unknown reasons. An
example of this has been provided in a study of human populations with genetic
distances (Nei & Roychoudhury, 1993). The properties of most
popular genetic distance measures have been reviewed (Kalinowski, 2002). Whichever is used,
large sample sizes are required when populations are relatively genetically
similar, and loci with more alleles produce better estimates of genetic
distance. However, in a simulation study, Nei et al concluded that more than 30
loci should be used for making phylogenetic trees (Nei, 1983).
There seems to be a consensus that estimated tress are nearly always erroneous
(i.e., the topological arrangement will be wrong) if the number of loci is less
than 30 (Nei, 1996;
Jorde LB. Human genetic distance studies. Ann Rev Anthropol
1985;14:343-73; available through JSTOR if you have access). If populations are closely
related even 100 loci may be necessary for an accurate estimation of the
relationships by genetic distance methods. Cavalli-Sforza et al have noted
important correlations between the genetic trees and linguistics evolutionary
trees with the exceptions for New Guinea, Australia and South America (Cavalli-Sforza, 1994).
Especially for the HLA genes, phylogenetic tress can be constructed by
using the Nei's DA genetic distance values and NJ method with bootstrap tests
on DISPAN. Correspondence analysis, a supplementary analysis to genetic distances and dendrograms,
displays a global view of the relationships among populations (Greenacre, 1984; Greenacre & Blasius, 1994; Blasius & Greenacre, 1998). This
type of analysis tends to give results similar to those of dendrograms as
expected from theory (Cavalli-Sforza & Piazza, 1975),
and is more informative and accurate than dendrograms especially when there is
considerable genetic exchange between close geographic neighbors (Cavalli-Sforza, 1994). In their
enormous effort to work out the genetic relationships among human populations,
Cavalli-Sforza et al concluded that two-dimensional scatter plots obtained by
correspondence analysis frequently resemble geographic maps of the populations
with some distortions (Cavalli-Sforza, 1994). Using the same
allele frequencies that are used in phylogenetic tree construction, correspondence analysis using allele
frequencies can be performed on the ViSta (v7.0), VST, SAS but most
conveniently on Multi Variate Statistical
Package MVSP.
Link to a Tutorial on Correspondence Analysis.
Spreadsheet
Exercises in Ecology & Evolution
&
Spreadsheet
Exercises in Conservation Biology & Landscape Ecology
(including HWE and LD)
History of Population Genetics and Evolution in A History of Genetics by AH Sturtevant
Introduction
to Population Genetics @ NBII
ASHI 2001 Biostatistics and Population Genetics Workshop
Notes
Microsatellites and Genetic Distance (Primer on Genetic
Distance)
HWE in Kimball's Biology Pages Online
Biology Book: Genetics - HWE
Online HWE Test
Online GD Calculation Genetic
Power Calculator
Simulations: Population Biology Population Genetics Evolutionary Biology
Human Genetics for the Social Sciences Interactive Learning
Exercises
Lectures on Population Genetics (1) & (2) & (3)
& (4)
& (5) & (6) Population Genetics Glossary
Population Genetics Course by Paula Nuin at
McMaster
Population Genetics Course by Knud Christensen (PDF of whole course)
Population Genetics Course by Kent E Holsinger
Population Genetics (HWE) Notes by Robert J Huskey
Evolutionary Quantitative Genetics Notes by Bruce
Walsh
Human Genome Epidemiology Online Book Medical
Applications of Population Genetics
Genetic
Epidemiology Notes Genetic Epidemiology Glossary
Software for Population Genetic Analyses
Freeware
Population Genetic Data Analysis Software (List of Features):
Arlequin v3.1 (2005) PopGene
GDA Genetix GenePop GeneStrut SGS
EMLD LDA MIDAS PyPop
(HLA & HWE/LD)
Gene[VA] Tools for Genetic Data Analysis
(AGP,
Universite de Geneve,
Switzerland):
HLA
Data Analysis (HWE, LD, Association) (EFI2006 Teaching Session)
Linkage Disequilibrium Software (Oxford, UK)
including GOLD Extended
Haplotype Homozygosity (EHH) Web-Tool (ref)
HAPLOTYPER Genotype2LDBlock
DISEQ PHASE v2.0
UNPHASED HAPLOVIEW
(Tutorial)
HWE
PHYLIP
(BioPortal) PHYLIP (Pasteur)
PHYLIP 3.62 (Download) DISPAN ViSta GenAlEx CLUMP TDT
Genetic Calculation Applets by Knud Christensen
Software at Dyer Laboratory for Population Genetics
(including AMOVA)
MSA (for microsatellite data) POPULATIONS
WINPOP v2.0
QUANTO
Computational
Genetics Software including EHAP
GSF: Genetic
Software Forum Partition for Online Bayesian Analysis
Comprehensive List of Genetic Analysis Software (1) (2) (3)
Review of common Population Genetics software by Labate JA, 2000
Population Genetics Software Course at UCDAVIS Genomic
Variation Laboratory
Linkage
Disequilibrium Analysis Bibliography
Address For bookmark: http://www.dorak.info/genetics/popgen.html
M.Tevfik
Dorak, MD, PhD
Last
updated 3 May 2007
Genetics Genetic
Epidemiology Evolution Biostatistics HLA MHC Glossary
Homepage