Genetics Genetic Epidemiology Evolution
Biostatistics HLA MHC Glossary Homepage
BASIC POPULATION GENETICS
M.Tevfik Dorak, M.D., Ph.D.
G.H. Hardy (the English mathematician) and W. Weinberg (the German
physician) independently worked out the mathematical basis of population
genetics in 1908 (Hardy, 1908). Their formula predicts
the expected genotype frequencies using the allele frequencies in a diploid
Mendelian population. They were concerned with questions like "what
happens to the frequencies of alleles in a population over time?" and
"would you expect to see alleles disappear or become more frequent over time?"
Hardy and Weinberg showed in the following manner that if the population
is very large and random mating is taking place, allele frequencies remain
unchanged (or in equilibrium) over time unless some other factors intervene. If
the frequencies of allele A and a (of a biallelic locus) are p and q, then (p +
q) = 1. This means (p + q)2 = 1 too. It is also correct that (p + q)2
= p2 + 2pq +q2 = 1. In this formula, p2
corresponds to the frequency of homozygous genotype AA, q2 to aa,
and 2pq to Aa. Since 'AA, Aa, aa' are the three possible genotypes for a
biallelic locus, the sum of their frequencies should be 1. In summary,
Hardy-Weinberg formula shows that:
p2 + 2pq + q2 = 1
AA ... Aa .. aa
If the observed frequencies do not show a significant difference from
these expected frequencies, the population is said to be in Hardy-Weinberg
equilibrium (HWE). If not, there is a violation of the following assumptions of
the formula, and the population is not in HWE. Of course, the three observed
genotype frequencies will always sum up to 1.0 but to check HWE, one starts
with p calculated from the observed homozygote genotype frequency and calculate
q as (1-p) and estimate expected heterozygote frequency (2pq) and homozygote
frequency for the other allele (q2). If these frequencies are not
similar to observed values, the sample may not be in HWE. For formal testing
try one of the online tools (Online HWE
and Association Testing (SNP); Online HWE Test (Multiallelic Markers);
OEGE HWE Calculator; Genetic Calculation Applets (up to four alleles);
Haploview
(SNP)) or freeware (Arlequin v3.01; PopGene; GDA; TFPGA; PyPop
(HLA & HWE/LD). Stata programs are
available to test HWE (Genetic Epidemiology Programs for Stata)
and Excel
can be used too. One important point is to choose an exact test for
multiallelic markers because Chi-squared tests are inappropriate when there are
multiple alleles (Guo & Thompson, 1992). It has been
argued that even for large samples, Chi-squared test is inappropriate and exact
tests should be used in assessment of HWE (Wigginton, 2005; PDF). In any case, HWE test is a low
power statistical test to identify genotyping errors (Leal, 2005).
The assumptions of
HWE
1. Population size is effectively infinite,
2. Mating is random in the population (the most common deviation results
from inbreeding),
3. Males and females have similar allele frequencies, and the locus is
autosomal,
4. There are no mutations and migrations affecting the allele
frequencies in the population,
5. The genotypes have equal fitness, i.e., there is no selection (in
viability and fitness).
The Hardy-Weinberg law suggests that as long as the assumptions are
valid, allele and genotype frequencies will not change in a population in
successive generations. Thus, any deviation from HWE may indicate the
following biological processes:
1. Small population size results in random sampling errors and
unpredictable genotype frequencies (a real population's size is always finite
and the frequency of an allele may fluctuate from generation to generation due
to chance events),
2. Assortative mating which may be positive (increases homozygosity;
self-fertilization is an extreme example) or negative (increases
heterozygosity), or inbreeding which increases homozygosity in the whole genome
without changing the allele frequencies. Rare-male mating advantage also tends
to increase the frequency of the rare allele and heterozygosity for it (in
reality, random mating does not occur all the time). Cryptic population
stratification is another reason for departure from HWE.
3. A very high mutation rate in the population (typical mutation rates
are < 10-5 per generation) or massive migration from a
genotypically different population interfering with the allele frequencies,
4. Selection of one or a combination of genotypes (selection may be negative
or positive). Selective elimination of homozygotes as in some autosomal
dominant diseases, where homozygotes for the mutation may die in utero, is an
example (in a very large sample, this could violate HWE). Similar to this
selection, sampling error (selection bias) may also affect HWE if bias
concerned ethnicity.
5. Unequal transmission ratio (transmission ratio distortion or
segregation distortion) of alternative alleles from parents to offspring (as in
mouse t-haplotypes),
6. Differential gene frequency among males and females,
In most population genetic estimations (like linkage disequilibrium
calculations), HWE is assumed. This means that genotype probabilities are
determined by allele frequencies and nothing interferes with that. If this
assumption is not met, the estimations will not be accurate. When HWE is
assumed, this means that genotype probabilities are determined by allele
frequencies, i.e., there is no transmission ratio distortion, selection against
a genotype (lethality) etc. If HWE is violated, statistical methods using
allele frequencies may not be valid and methods that use genotype frequencies
should be preferred (Xu, 2002) (discussed below). See HWE
simulation in EvoTutor
for the effects of selection, mutation,
migration, genetic drift and assortative mating.
The implications of
the HWE
1. The allele frequencies remain constant from generation to generation.
This means that hereditary mechanism itself does not change allele frequencies.
It is possible for one or more assumptions of the equilibrium to be violated
and still not produce deviations from the expected frequencies that are large
enough to be detected by the goodness of fit test,
2. When an allele is rare, there are many more heterozygotes than
homozygotes for it. Thus, rare alleles will be impossible to eliminate even if
there is selection against homozygosity for them,
3. For populations in HWE, the proportion of heterozygotes is maximal
when allele frequencies are equal (p = q = 0.50), and when this happens the
heterozygote frequency will be 0.50 (2x0.50x0.50). Unless HWE is violated (as
in selective loss of homozygotes), heterozygosity can never be more than 0.50
at any biallelic locus (see Figure in Prof Whitloc's Review).
4. An application of HWE is that when the frequency of an autosomal
recessive disease (e.g., sickle cell disease, hereditary hemochromatosis,
congenital adrenal hyperplasia) is known in a population and unless there is
reason to believe HWE does not hold in that population, the gene frequency of
the disease gene can be calculated. See also Clinical Genetics. Likewise, the
carrier rate may be calculated for autosomal recessive disorders if the disease
gene frequency is known. For example, phenylketonuria (PKU) occurs in 1/11,000 (q2), which gives a
heterozygote carrier frequency of approximately 1/50 [ 2xq(1-q) ]. If the diseased
individuals (q2)
are deducted from the whole population, the carrier rate in normal individuals
approximates to [ 2q/1+q ].
It has to be remembered that when HWE is tested, mathematical thinking
is necessary. When the population is found in equilibrium, it does not
necessarily mean that all assumptions are valid since there may be
counterbalancing forces. Similarly, a significant deviance may be due to
sampling errors (including Wahlund effect, see below and Glossary),
misclassification of genotypes, measuring two or more systems as a single
system, population substructure, failure to detect rare alleles and the
inclusion of non-existent alleles. The Hardy-Weinberg laws rarely holds true in
nature (otherwise evolution would not occur). Organisms are subject to
mutations, selective forces and they move about, or the allele frequencies may
be different in males and females. The gene frequencies are constantly changing
in a population, but the effects of these processes can be assessed by using
the Hardy-Weinberg law as the starting point.
The direction of departure of observed from expected frequency cannot be
used to infer the type of selection acting on the locus even if it is known
that selection is acting. If selection is operating, the frequency of each
genotype in the next generation will be determined by its relative fitness (W).
Relative fitness is a measure of the relative contribution that a genotype
makes to the next generation. It can be measured in terms of the intensity of
selection (s), where W = 1 - s [0 £ s £
1]. The frequencies of each genotype after selection will be p2 WAA,
2pq WAa, and q2 Waa. The highest fitness is
always 1 and the others are estimated proportional to this. For example, in the
case of heterozygote advantage (or overdominance), the fitness of the
heterozygous genotype (Aa) is 1, and the fitnesses of the homozygous genotypes
negatively selected are WAA = 1 - sAA and Waa
= 1 - saa. It can be shown mathematically that only in this case a
stable polymorphism is possible. Other selection forms, underdominance and
directional selection, result in unstable polymorphisms. The weighted average
of the fitnesses of all genotypes is the mean fitness. It is important that
genetic fitness is determined by both fertility and viability. This means that
diseases that are fatal to the bearer but do not reduce the number of progeny
are not genetic lethals and do not have reduced fitness (like the adult onset
genetic diseases: Huntington's chorea, hereditary hemochromatosis). The
detection of selection is not easy because the impact on changes in allele
frequency occurs very slowly and selective forces are not static (may even vary
in one generation as in antagonistic pleiotropy).
All discussions presented so far concerns a simple biallelic locus. In
real life, however, there are many loci which are multiallelic, and interacting
with each other as well as with the environmental factors. The Hardy-Weinberg
principle is equally applicable to multiallelic loci but the mathematics is
slightly more complicated. For multigenic and multifactorial traits, which are
mathematically continuous as opposed to discrete, more complex techniques of
quantitative genetics are required.
In a final note on the practical use of HWE, it has to be emphasized
that its violation in daily life is most
frequently due to genotyping errors (Gomes, 1999;
Lewis, 2002).
Allelic misassignments, as frequently happens when PCR-SSP method is used, sometimes due to allelic
dropout (increased homozygosity) are the most frequent causes of Hardy-Weinberg
disequilibrium. When this is observed, the genotyping protocol should be
reviewed. Another common reason is the presence of an unknown allele which is
not considered in the genotyping scheme (null allele). This happens when a
variant is considered to be bi-allelic while it is actually multiallelic. In a
case-control association study, it is of paramount importance that the control
group is in HWE to rule out any technical errors and to avoid false-positive
associations (Tiret, 1995; Schaid & Jacobsen, 1999). The
violation of HWE in the case group, however, may be due to a real association (Nielsen, 1998;
Lee, 2003; Czika & Weir, 2004). When
Hardy-Weinberg disequilibrium is not due to a technical error, the statistical
evaluation of the data should involve statistical tests using genotype
frequencies rather than allele frequencies (Xu, 2002; Lewis, 2002). One such test is the trend test (to
estimate common odds ratio) and can be run online for SNP data (Online HWE
& Association Testing).
Departure from proportions can bias
estimates of estimated haplotype frequencies from population data.
Departure from HWE may be a substantial source of error
in Expectation-Maximization haplotype frequency estimation, simply
because the algorithm relies on HWE in its 'expectation'
step (Fallin & Schork, 2000). For reviews of HWE in
genetic epidemiology studies, see Gomes, 1999; Attia,
2003; Leal, 2005; Zou & Donner, 2006.
See Testing for Consistency with HW Proportions
at GENESTAT
Statistical Genetics Pages; PEDSTATS HWE Tutorial; HWE Slide Show.
Some concepts
relevant to HWE
Wahlund effect:
Reduction in observed heterozygosity (increased homozygosity) because of
pooling discrete subpopulations with different allele frequencies that do not
interbreed as a single randomly mating unit. When all subpopulations have the
same gene frequencies, no variance among subpopulations exists, and no Wahlund
effect occurs (FST = 0). Isolate breaking is the phenomenon
that the average heterozygosity temporarily increases when discrete
subpopulations make contact and interbreed (this is due to decrease in homozygotes).
It is the opposite of Wahlund effect
(Wahlund Effect & F-Statistics Simulation Applet).
F statistics:
The F statistics in population genetics has nothing to do the F statistics
evaluating differences in variances. Here F stands for fixation index,
fixation being increased homozygosity resulting from inbreeding. Population
subdivision results in the loss of genetic variation (measured by
heterozygosity) within subpopulations due to their being small populations and
genetic drift acting within each one of them. This means that population
subdivision would result in decreased heterozygosity relative to that expected
heterozygosity under random mating as if the whole population was a single
breeding unit. Wright developed three fixation indices to evaluate population
subdivision: FIS (interindividual), FST
(subpopulations), FIT (total population).
FIS
is a measure of the deviation of genotypic frequencies from panmictic
frequencies in terms of heterozygous deficiency or excess. It is what is known
as the inbreeding coefficient (f), which is conventionally
defined as the probability that two alleles in an individual are identical by
descent (autozygous). The technical description is the correlation of uniting
gametes relative to gametes drawn at random from within a subpopulation (Individual
within the Subpopulation) averaged over subpopulations. It is calculated
in a single population as FIS =
1 - (HOBS / HEXP) [equal to (HEXP - HOBS
/ HEXP) where HOBS is the observed heterozygosity and HEXP
is the expected heterozygosity calculated on the assumption of random mating.
It shows the degree to which heterozygosity is reduced below the expectation.
The value of FIS ranges between -1 and +1. Negative FIS
values indicate heterozygote excess (outbreeding) and positive values indicate
heterozygote deficiency (inbreeding) compared with HWE expectations.
FST
measures the effect of population subdivision, which is the reduction in
heterozygosity in a subpopulation due to genetic drift. FST is the
most inclusive measure of population substructure and is most useful for
examining the overall genetic divergence among subpopulations. It is also
called coancestry coefficient (q) [Weir & Cockerham, 1984] or 'Fixation
index' and is defined as correlation of gametes within subpopulations
relative to gametes drawn at random from the entire population (Subpopulation
within the Total population). It is calculated as using the subpopulation
(average) heterozygosity and total population expected heterozygosity. FST
is always positive; it ranges between 0 = panmixis (no subdivision, random
mating occurring, no genetic divergence within the population) and 1 = complete
isolation (extreme subdivision). FST values up to 0.05 indicate
negligible genetic differentiation whereas >0.25 means very great genetic
differentiation within the population analyzed. FST is usually
calculated for different genes, then averaged across all loci, and all populations.
For human populations, the average value of FST for a large number
of DNA polymorphisms is 0.139 (and 0.119 for non-DNA polymorphisms) (Cavalli-Sforza, 1994).
Using the FST values, less differentiation is seen between human
populations within continents than between continents. This is consistent with
simple isolation by distance (the greater variability of populations from
sub-Saharan Africa is the rule). FST can also be used to estimate
gene flow: 0.25 (1-FST) / FST. This highly versatile
parameter is even used as a genetic distance measure between two populations
instead of a fixation index among many populations (see Weir BS, Genetic Data Analysis II, 1996 (Genetic Data Analysis III, 2007) and Kalinowski, 2002).
FIT
is rarely used. It is the overall inbreeding coefficient (F) of
an individual relative to the total population (Individual within the Total
population).
See also a Lecture Note on Analysis of Molecular
Variance (AMOVA) is a method of estimating population differentiation directly
from molecular data (and Genetic
Epidemiology Glossary).
Detecting
Selection Using DNA Polymorphism Data
Several methods have been designed to use DNA
polymorphism data (sequences and allele frequencies) to obtain information on
past selection events. Most commonly, the ratio of
non-synonymous (replacement) to synonymous (silent) substitutions (dN/dS ratio;
see below) is used as evidence for overdominant selection (balancing selection)
of which one form is heterozygote advantage. Classic example of this is the mammalian MHC
system genes and other compatibility
systems in other organisms: the self-incompatibility
system of the plants, fungal mating types and invertebrate allorecognition
systems. In all these genes, a very high number of alleles is also noted. This
can be interpreted as an indicator of some form of balancing (diversifying)
selection. In the case of neutral polymorphism, one common allele and a few
rare alleles are expected. The frequency distribution of alleles is also
informative. Large number of alleles showing a relatively even distribution is
against neutrality expectations and suggestive of diversifying selection.
Most
tests detect selection by rejecting neutrality assumption (observed data is
deviate significantly from what is expected under neutrality). This deviation,
however, may also be due to other factors such as changes in population size or
genetic drift (see Tripathy & Reddy, 2007 (Table 1) for a review in the context
of G6PD deficiency). The original neutrality test was Ewens-Watterson homozygosity test of
neutrality (see Glossary)
based on the comparison of o homozygosity and predicted value calculated by Ewens's sampling formula, which uses
the number of alleles and sample size. This test is not very powerful.
Other
commonly used statistical tests of neutrality are Tajima's D (theta), Fu &
Li's D, D* and F. Tajima's test (Tajima, 1989) is based on the fact
that under the neutral model estimates of the number of segregating/polymorphic
sites and of the average number of nucleotide differences are correlated. If
the value of D is too large or too small, the neutral 'null' hypothesis is
rejected. DnaSP
calculates the D and its confidence limits (two-tailed test). Tajima did not
base this test on coalescent but Fu and Li's tests (Fu & Li, 1993) are directly based
on coalescent. The tests statistics D and F require data from intraspecific
polymorphism and from an outgroup (a sequence from a related species), and D*
and F* only require intraspecific data. DnaSP uses the critical values
obtained by Fu & Li, 1993 to determine the
statistical significance of D, F, D* and F* test statistics. DnaSP
can also conduct the Fs test statistic (Fu, 1997). The results of this group
of tests (Tajima's D and Fu & Li's tests) based on allelic variation and/or
level of variability may not clearly distinguish between selection and
demographic alternatives (bottleneck, population subdivision) but this problem
only applies to the analysis of a single locus (demographic changes affect all
loci whereas selection is expected to be locus-specific which are
distinguishable if multiple loci are analyzed). Tests for multiple loci include
the HKA test described by Hudson et al (1987). This test is based on the idea
that in the absence of selection, the expected number of polymorphic (segregating)
sites within species and the expected number of 'fixed' differences between
species (divergence) are both proportional to the mutation rate, and the ratio
of them should be the same for all loci. Variation in the ratio of divergence
to polymorphism among loci suggests selection.
A
different group of neutrality tests that are not sensitive to demographic
changes include McDonald-Kreitman test (McDonald & Kreitman, 1991) and dN/dS ratio test. McDonald-Kreitman test
compares the ratio of the number of nonsynonymous to synonymous 'polymorphisms'
within species to that ratio of the number of nonsynonymous to synonymous
'fixed' differences between species in a 2x2 table. The most direct method of
showing the presence of positive selection is to compare the number of
nonsynonymous (dN) to the number of
synonymous (dS) substitutions in a locus.
A high (>1) value of (dN/dS) substitutions suggest fixation
of nonsynonymous mutations with a higher probability than neutral (synonymous)
ones. Statistical properties of this test are given by Goldman & Yang, 1994 and by Muse & Gaut, 1994. The dN/dS ratio tests take into
account of transition/transversion rate bias and codon usage bias.
For
other tests and software to perform these statistics, see DNA Sequence Polymorphism, DnaSP. See also: Statistical
Tests of Neutrality (Lecture Note by P Beerli); Statistical Tests of Neutrality of Mutations against Excess
of Recent Mutations (Rare Alleles); Statistical Tests of Neutrality of Mutations against an
Excess of Old Mutations or a Reduction of Young Mutations;
Estimation
of theta; Innan & Tajima, 2002; Properties of
statistical tests of neutrality for DNA polymorphism data, Simonsen, 1995. Review of
statistical tests of selective neutrality on genomic data, Nielsen, 2001, Luikart, 2003, Harris & Meyer, 2006 and a Lecture Note by Gil McVean. See also a
review by SP Otto (Detecting
the Form of Selection from DNA Sequence Data. TIG 2000).
Linkage disequilibrium (LD)
The tendency for two 'alleles' to be present on the
same chromosome (positive LD), or not to segregate together (negative LD). As a
result, specific alleles at two different loci are found together more or less
than expected by chance. LD is the nonindependence, at a population level, of
the alleles carried at different positions in the genome. In this case, the
expected frequency of a two-locus haplotype can be calculated as the
probability of the occurrence of two independent (or joint) events simply by
multiplying their gene frequencies. The same situation may exist for more than
two alleles. Its magnitude is expressed as the delta (D) value and corresponds to
the difference between the expected and the observed haplotype frequency. If
there is no LD, D will be zero (or not significantly different from zero), if there is
positive LD it will be a positive value. It can also be negative if the two
alleles tend not to occur together. The statistical significance of LD, which
depends on the sample size, and the magnitude of LD are separate issues. The
statistical significance is determined by usually Fisher test and the magnitude
is determined by either D value or alternative measures. The magnitude can be normalized (for
allele frequencies) to have the same range of values for any frequency.
Ideally, the haplotype frequencies should be
calculated from family typing data. Obviously, this gives the most accurate
results. In practice, however, when family data are not available, D
and two-locus haplotype frequencies are calculated from a sample of the
population data by constructing 2x2 contingency tables for each allele pair. A
contingency table for this purpose contains the individual (observed) values
cross-classified by levels in two different attributes. A common 2x2 table
constructed in genetic studies is as follows:
|
|
|
|
allele i |
|
|
|
|
|
Present (+) |
|
Absent (-) |
Row totals |
|
|
Present (+) |
a (+/+) |
|
b (+/-) |
a+b |
|
allele j |
|
|
|
|
|
|
|
Absent (-) |
c (-/+) |
|
d (-/-) |
c+d |
|
|
Column totals |
a+c |
|
b+d |
N=a+b+c+d |
Counts for each combination of levels (presence or
absence) of the two factors (alleles) are placed in each cell. The
corresponding Dij is estimated by the formula
(usually in HLA studies):
Dij = (d/N)1/2
– [((b+d)/N)((c+d)/N)]1/2
(originally described by Bodmer & Bodmer, 1970; see also Schipper,
1998)
The haplotype frequency (HFij) equals to GFi x GFj + Dij, where GF is the gene frequency (the proportion of the chromosomes
carrying a particular allele). The haplotype frequency calculated with this
formula from the population data compares reasonably well with the estimates
obtained directly from counting haplotypes constructed from family segregation
data. This method generates a reliable estimate of a haplotype frequency with
the exception of very small haplotype frequencies. Also for other parts of the
genome, it has been reported that there is little or no advantage to
constructing haplotypes from family data rather than unrelated individuals. The
major point is that when using population data, genotyping errors become an
issue. When genotyping a large number of markers, an error rate of only 1% will
produce a large number of inaccurate haplotypes. Genotyping errors are not the
only possible sources of accuracy problems. Other factors include sample size,
allele frequency distributions and departures from HWE.
There are other measures of LD. Because the value
of D
depends on allele frequencies a normalization of D is needed. This is
achieved by taking into account the allele frequencies: normalized delta value
(D') = DAB
/ Dmax. Dmax is the lesser of pApb
or papB if D is
positive or pApB or papb if D is negative. Because the sign is arbitrary, | D' |
is often used rather than D'. Therefore, D' (normalized LD) is
scaled to remove allele frequency effects. In a large enough sample, D' = 1
indicates complete LD and D' = 0 corresponds to no LD. |D'| is directly related
to recombination fraction and its generalization to more than two loci is the
only measure of LD not sensitive to allele frequencies. HAPLOVIEW and MIDAS are some of the software that calculate D' values.
Another statistic
of linkage disequilibrium is the square of the correlation coefficient (r2) between the alleles at
locus A and B: r2 = D2/ (pA pa pB pb) which can also be expressed as r2 = D2 / (pA (1-pA) pB (1-pB)) (for two loci with two
alleles each). The
measure r2
has several properties that make it more useful (Pritchard, 2001; Weiss, 2002; Carlson, 2004).
In brief, for low allele frequencies r2 has more reliable sample
properties than |D'|. The allelic
association metric r (rho) has the strongest population theory basis, least
sensitive to marker allele frequencies (Morton,
2001); and has several statistically optimal properties
such as consistency, asymptotic unbiasedness and asymptotic efficiency (Shete, 2003). This parameter can be
estimated on MIDAS
for SNP data (along with other measures of LD). (See
CIGMR; GENESTAT LD Measures; CGIL Summer Course Notes and Measures of LD by Hedrick,
1987; Lewontin,
1988; Devlin
& Risch, 1995; Devlin,
1996; Morton,
2001; Wall & Pritchard, 2003; Shete, 2003; Jorde, 2003; Mueller, 2004; Zaykin, 2004; Wang, 2005; GOLD-Disequilibrium Statistics and a Lecture Note at UCL for further details.)
Statistical
Methods for LD Estimation: While the
Mattiuz formula can be used to calculate two-locus haplotype frequencies manually, an
alternative method, the maximum likelihood estimation (achieved by EM '
expectation-maximization' algorithm), can be used if computing facilities are
available. This test was originally described by Yasuda and Tsuji (1975), compared with other methods by
Schipper et al (1998)
and Excoffier et al. (1995). It can also be used for
multiple-locus haplotypes (Long, 1995). One of the most sophisticated
population genetic data analysis packages ARLEQUIN
as well as EMLD use
EM algorithm to calculate multilocus LD. Some of the other software to perform
LD analysis are: MIDAS, HAPLOVIEW, Genetic Data Analysis (GDA), EH,
MLD,
DISEQ,
SHEsis,
GOLD and PopGene. The proprietary LD Software
HelixTree
(manual) computes multilocus haplotype
probabilities using the composite haplotype method (CHM) and the Expectation
Maximization (EM) algorithm for SNP data. LD analysis
can also be performed online (Online LD Analysis; Genotype2LDBlock; VG2). All of the above (pairwise) LD measures can be estimated
from unphased SNP data on STATA using David Clayton's program pwld within genassoc.pkg.
For HLA data, HWE and LD (as well as Ewens-Watterson test) can be assessed
using PyPop.
Interpretation of LD Data: The patterns of LD
observed in natural populations are the result of a complex interplay between
genetic factors and the population's demographic history (Pritchard, 2001). LD is usually a
function of distance between the two loci. This is mainly because recombination
acts to break down LD in successive generations (Hill, 1966). When a mutation first
occurs it is in complete LD with the nearest marker (D' = 1.0). Given enough
time and as a function of the distance between the mutation and the marker, LD
tends to decay and in complete equilibrium reached D' = 0 value. Thus, it
decreases at every generation of random mating unless some process is opposing
to the approach to linkage 'equilibrium'. However, physical distance could
account for less than 50% of the observed variation in LD. One genetic
phenomenon that affects LD is gene conversion. Gene conversion is an important
mechanism in the breaking down of allelic associations over short distances,
i.e., decay of LD. Other factors that influence LD include changes in
population demographics (such as population growth, bottlenecks, geographical
subdivision, admixture and migration) and selective forces. Admixture
(intermixture of populations) would cause LD if the mixing populations have
different allele frequencies. LD will also be erased faster in large
populations than in small ones (chance in small populations maintain LD).
Permanent LD may result from natural selection if some gametic combinations
confer higher fitness than other combinations. An extraordinary example of the
effect of recombination rates on LD is the discrepancy between genetic distance
and physical distance between HFE and HLA-A, which generates strong LD despite
5Mb distance (Malfroy,
1997). Regional LD may also be variable according to
haplotype. An example has been presented for HLA haplotypes. Haplotype-specific
patterns of LD (Ahmad, 2003) may reflect
haplotype-specific recombination hotspots as has been shown for mouse MHC.
Note that LD has nothing to do HWE and
should not be confused with it (see Possible
Misunderstandings in Genetics).
Genetic distance (GD)
Genetic
distance is a measurement of genetic relatedness of samples of populations
(whereas genetic diversity represents diversity within a population). The
estimate is based on the number of allelic substitutions per locus that have
occurred during the separate evolution of two populations. (See lecture notes
on Genetic Distances, Estimating
Genetic Distance; and GeneDist: Online
Calculator of Genetic Distance. The software Arlequin v3.01, PHYLIP, GDA, PopGene, Populations and SGS are suitable
to calculate population-to-population genetic distances from allele frequencies
(Microsat is a
microsatellite distance program).
Genetic Distance can be computed on freeware
PHYLIP. Most components of PHYLIP are available on
the web (or on Pasteur Webserver). One component of
the package GENDIST estimates genetic distance from
allele frequencies using one of the three methods: Nei's, Cavalli-Sforza's or
Reynold's (see papers by Cavalli-Sforza & Edwards, 1967, Nei, 1983, Nei, 1996 and lecture note (1) and (2) for more information on these methods).
GENDIST can be run online using default options (Nei's genetic distance) to obtain genetic
distance matrix data. The PHYLIP program CONTML estimates phylogenies from gene frequency data by
maximum likelihood under a model in which all divergence is due to genetic
drift in the absence of new mutations (Cavalli-Sforza's method)
and draws a tree. The program is also available on the web and runs with default options. If
new mutations are contributing to allele frequency changes, Nei's method should
be selected on GENDIST to estimate genetic distances first. Then a tree can be
obtained using one of the following components of PHYLIP: NEIGHBOR also draws a phylogenetic tree using
the genetic distance matrix data (from GENDIST). It uses either Nei and Saitou's (1987) "Neighbor Joining (NJ) Method," or the UPGMA (unweighted pair group method with
arithmetic mean; average linkage clustering) method (Sneath & Sokal, 1973). Neighbor Joining is a distance matrix method
producing an unrooted tree without the assumption of a clock (the evolutionary
rate does not have to be the same in all lineages). Major assumption of UPGMA
is equal rate of evolution along all branches (which is
frequently unrealistic). NEIGHBOR can be run online. Other components of PHYLIP that draw
phylogenetic trees from genetic distance matrix data are FITCH / online (Fitch-Margoliash method
with no assumption
of equal evolutionary rate) and KITSCH / online (employs Fitch-Margoliash and Least
Squares methods with the assumption that all tip species are contemporaneous,
and that there is an evolutionary clock -in effect, a molecular clock) (Mathematical Formulae of Various Genetic Distance Measures;
Genetic Distance Equations). Another
freeware PopGene
calculates
Nei's genetic distance and creates a tree using UPGMA method from genotypes.
For genetic distance calculation on Excel, try freeware GenAlEx
by Peakall & Smouse.
Because of different assumptions they are based on the NJ and UPGMA
methods may construct dendrograms with totally different topologies. For an
example of this and a review of main differences between the two methods, see Nei & Roychoudhury, 1993 (PDF). Both methods use distance
matrices (also Fitch-Margoliash and Minimal Evolution methods are distance
methods). The principle difference between NJ and UPGMA is that NJ does not
assume an equal evolutionary rate for each lineage. Since the constant rate of
evolution does not hold for human populations, NJ seems to be the better
method. For the genetic loci subject to natural selection, the evolutionary
rate is not the same for each population and therefore UPGMA should be avoided
for the analysis of such loci (including the HLA genes). The leading group in
HLA-based genetic distance analysis led by Arnaiz-Villena proposes that the most
appropriate genetic distance measure for the HLA system is the DA value first
described by Nei, 1983. Unlike UPGMA, NJ produces an unrooted tree. To
find the root of the tree, one can add an outgroup. The point in the tree where
the edge to the outgroup joins is the best possible estimate for the root
position. One persistent problem with tree construction is the lack of statistical
assessment of the phylogenetic tree presented. This is best done with widely
available bootstrap analysis originally described in Felsenstein J: Evolution
1985;39:783-791 (available through JSTOR if you have access) and Efron, 1996; and reviewed in (Nei, 1996). For a discussion of
statistical tests of molecular phylogenies, see Li & Gouy, 1990 and Nei, 1996. For the topology to be
statistically significant the bootstrap value for each cluster should reach at
least 70% whereas 50% overestimates accuracy of the tree. Bootstrap tests
should be done with at least 1000 (preferably more) replications.
Nei noted that some genes are more suitable than
others in phylogenetic inference and that most tree-building methods tend to
produce the same topology whether the topology is correct or not (Nei M, 1996). He also added that sometimes adding one
more species/population would change the whole tree for unknown reasons. An
example of this has been provided in a study of human populations with genetic
distances (Nei & Roychoudhury, 1993). The properties of most
popular genetic distance measures have been reviewed (Kalinowski, 2002). Whichever is used,
large sample sizes are required when populations are relatively genetically
similar, and loci with more alleles produce better estimates of genetic
distance. However, in a simulation study, Nei et al concluded that more than 30
loci should be used for making phylogenetic trees (Nei, 1983). There seems to be a consensus that estimated
tress are nearly always erroneous (i.e., the topological arrangement will be wrong)
if the number of loci is less than 30 (Nei, 1996;
Jorde LB. Human genetic distance studies. Ann Rev Anthropol
1985;14:343-73; available through JSTOR if you have access). If populations are closely
related even 100 loci may be necessary for an accurate estimation of the
relationships by genetic distance methods. Cavalli-Sforza et al have noted
important correlations between the genetic trees and linguistics evolutionary
trees with the exceptions for New Guinea, Australia and South America (Cavalli-Sforza, 1994).
Especially for the HLA genes,
phylogenetic trees can be constructed by using the Nei's DA genetic distance
values and NJ method with bootstrap tests on DISPAN. Correspondence analysis, a supplementary analysis
to genetic distances and dendrograms, displays a global view of the relationships
among populations (Greenacre, 1984; Greenacre & Blasius, 1994; Blasius & Greenacre, 1998). This
type of analysis tends to give results similar to those of dendrograms as
expected from theory (Cavalli-Sforza & Piazza, 1975),
and is more informative and accurate than dendrograms especially when there is
considerable genetic exchange between close geographic neighbors (Cavalli-Sforza, 1994). In their
enormous effort to work out the genetic relationships among human populations,
Cavalli-Sforza et al concluded that
two-dimensional scatter plots obtained by correspondence analysis frequently
resemble geographic maps of the populations with some distortions (Cavalli-Sforza, 1994). Using the same
allele frequencies that are used in phylogenetic tree construction, correspondence analysis using allele
frequencies can be performed on the ViSta (v7.0), VST, SAS but most
conveniently on Multi Variate
Statistical Package MVSP. Link to a
Tutorial on Correspondence Analysis.
Spreadsheet
Exercises in Ecology & Evolution
&
Spreadsheet
Exercises in Conservation Biology & Landscape Ecology
(including HWE
and LD)
History of Population Genetics and Evolution in A History of Genetics by AH Sturtevant
Introduction
to Population Genetics @ NBII
ASHI 2001 Biostatistics and Population Genetics Workshop
Notes
Microsatellites and Genetic Distance (Primer on Genetic
Distance)
HWE in Kimball's Biology Pages Online
Biology Book: Genetics - HWE
Online HWE Test OEGE HWE Calculator HWE and
Association Testing for SNPs in Case-Control Studies
Online GD Calculation Genetic Power Calculator
Simulations: Population Biology Population Genetics Evolutionary Biology
Human Genetics for the Social Sciences Interactive Learning
Exercises
Lectures on Population Genetics (1) & (2) & (3)
& (4)
& (5) & (6) Population Genetics Glossary
Population Genetics Course by Knud Christensen (PDF of whole course)
Population
Genetics Course by Kent E Holsinger
Evolutionary Quantitative Genetics Notes by Bruce
Walsh
Human Genome Epidemiology Online Book Medical
Applications of Population Genetics
Genetic Epidemiology Notes Genetic Epidemiology Glossary
Software for Population Genetic Analyses
Freeware Population Genetic Data Analysis
Software (List of Features):
Arlequin v3.1 (2005) PopGene
GDA Genetix GenePop GeneStrut SGS TFPGA MVSP
EMLD LDA MIDAS PyPop (HLA & HWE/LD)
Gene[VA] Tools for Genetic Data Analysis
(AGP,
Universite de Geneve, Switzerland):
HLA
Data Analysis (HWE, LD, Association) (EFI2006 Teaching Session)
GOLD-Disequilibrium Statistics
Extended
Haplotype Homozygosity (EHH) Web-Tool (ref) Haploplotter (ref)
HAPLOTYPER Genotype2LDBlock
DISEQ PHASE v2.0
UNPHASED HAPLOVIEW
(Tutorial)
HWE
PHYLIP (BioPortal) PHYLIP (Pasteur)
PHYLIP 3.62 (Download) DISPAN ViSta GenAlEx CLUMP TDT
Genetic Calculation Applets by Knud Christensen
Software at Dyer
Laboratory for Population Genetics
MSA (for microsatellite data) POPULATIONS
WINPOP v2.0
QUANTO
Computational Genetics Software including EHAP
GSF: Genetic Software Forum Partition for Online Bayesian Analysis
Comprehensive
List of Genetic Analysis Software (1) (2) (3)
Review of common
Population Genetics software by Labate JA, 2000
Topali v2 for multiple sequence alignments by BioSS
Population Genetics Software Course-ECL290 at UCDAVIS Genomic
Variation Laboratory
Linkage
Disequilibrium Analysis Bibliography
Address For
bookmark: http://www.dorak.info/genetics/popgen.html
M.Tevfik Dorak, MD, PhD
Last updated 14 September 2009
Genetics Genetic Epidemiology Evolution Biostatistics HLA MHC Glossary Homepage