Basic Population Genetics [M.Tevfik Dorak]

Genetics Genetic Epidemiology Evolution Biostatistics HLA MHC Glossary Homepage

BASIC POPULATION GENETICS

Mehmet Tevfik DORAK, MD, PhD

G.H. Hardy (the English mathematician) and W. Weinberg (the German physician) independently worked out the mathematical basis of population genetics in 1908 (Hardy, 1908). Their formula predicts the expected genotype frequencies using the allele frequencies in a diploid Mendelian population. They were concerned with questions like "what happens to the frequencies of alleles in a population over time?" and "would you expect to see alleles disappear or become more frequent over time?" Hardy and Weinberg showed in the following manner that if the population is very large and random mating is taking place, allele frequencies remain unchanged (or in equilibrium) over time unless some other factors intervene. If the frequencies of allele A and a (of a biallelic locus) are p and q, then (p + q) = 1. This means (p + q)² = 1 too. It is also correct that (p + q)² = p² + 2pq +q² = 1. In this formula, p² corresponds to the frequency of homozygous genotype AA, q² to aa, and 2pq to Aa. Since 'AA, Aa, aa' are the three possible genotypes for a biallelic locus like most SNPs, the sum of their frequencies should be 1 (see Hardy-Weinberg parabola (Definetti graph)). In summary, Hardy-Weinberg formula shows that:

p² + 2pq + q² = 1

AA ... Aa .. aa

If the observed frequencies do not show a significant difference from these expected frequencies, the population is said to be in Hardy-Weinberg equilibrium (HWE). If not, there is a violation of the following assumptions of the formula, and the population is not in HWE. Of course, the three observed genotype frequencies will always sum up to 1.0 but to check HWE, one starts with p calculated from the observed homozygote genotype frequency and calculate q as (1-p) and estimate expected heterozygote frequency (2pq) and homozygote frequency for the other allele (q²). If these frequencies are not similar to observed values, the sample may not be in HWE. The comparison of observed and expected values are achieved by the goodness-of-fit test. A statistically significant result suggests departure from HWE (or violation of HWE). Such a result may be due to biological reasons or misclassification of genotypes (genotyping error) as discussed below. For formal testing try one of the online tools (Online HWE Analysis (OEGE); HWE and Association Testing for SNPs in Case-Control Studies; Haploview (SNP)) or freeware (Arlequin; PopGene; GDA; TFPGA; PyPop (HLA & HWE/LD). One important point is to choose an exact test for multiallelic markers because Chi-squared tests are inappropriate when there are multiple alleles (Guo & Thompson, 1992). It has been argued that even for large samples, Chi-squared test is inappropriate and exact tests should be used in assessment of HWE (Wigginton, 2005; PDF). In any case, HWE test is a low power statistical test to identify genotyping errors (Leal, 2005).

Assumptions of HWE

1. Population size is effectively infinite (in small populations, departure from expected genotype frequencies occurs more easily)

2. Mating is random in the population (the most common deviation results from inbreeding)

3. Males and females have similar allele frequencies, and the locus is autosomal

4. There are no mutations and migrations affecting the allele frequencies in the population

5. The genotypes have equal fitness, i.e., there is no selection (in viability and fitness)

The Hardy-Weinberg law suggests that as long as the assumptions are valid, allele and genotype frequencies will not change in a population in successive generations. Thus, any deviation from HWE may indicate the following biological processes:

1. Small population size results in random sampling errors and unpredictable genotype frequencies (a real population's size is always finite and the frequency of an allele may fluctuate from generation to generation due to chance events),

2. Assortative mating which may be positive (increases homozygosity; self-fertilization is an extreme example) or negative (increases heterozygosity), or inbreeding which increases homozygosity in the whole genome without changing the allele frequencies. Rare-male mating advantage also tends to increase the frequency of the rare allele and heterozygosity for it (in reality, random mating does not occur all the time). Cryptic population stratification (population substructure) is another reason for departure from HWE.

3. A very high mutation rate in the population (typical mutation rates are < 10^-4 per generation) or massive migration from a genotypically different population interfering with the allele frequencies,

4. Selection of one or a combination of genotypes (selection may be negative or positive). Selective elimination of homozygotes as in some autosomal dominant diseases, where homozygotes for the mutation may die in utero, is an example. In a very large sample, this could violate HWE (see Ineichen & Batschelet, 1975 for the effect of natural selection on HWE). Similar to this selection, sampling error (selection bias) may also affect HWE if the bias concerned ethnicity.

5. Unequal transmission ratio (transmission ratio distortion or segregation distortion) of alternative alleles from parents to offspring (as in mouse t-haplotypes),

6. Differential gene frequency among males and females,

In most population genetic estimations (like linkage disequilibrium calculations), HWE is assumed. This means that genotype probabilities are determined by allele frequencies and nothing interferes with that. If this assumption is not met, the estimations will not be accurate. When HWE is assumed, this means that genotype probabilities are determined by allele frequencies, i.e., there is no transmission ratio distortion, selection against a genotype (lethality) etc. If HWE is violated, statistical methods using allele frequencies may not be valid and methods that use genotype frequencies should be preferred (Xu, 2002) (discussed below). See HWE simulation in EvoTutor for the effects of selection, mutation, migration, genetic drift and assortative mating.

Implications of HWE

1. The allele frequencies remain constant from generation to generation. This means that hereditary mechanism itself does not change allele frequencies. It is possible for one or more assumptions of the equilibrium to be violated and still not produce deviations from the expected frequencies that are large enough to be detected by the goodness-of-fit test,

2. When an allele is rare, there are many more heterozygotes than homozygotes for it. Thus, rare alleles will be impossible to eliminate even if there is selection against its homozygous genotype (as in recessive disorders),

3. For populations in HWE, the proportion of heterozygotes is maximal when allele frequencies are equal (p = q = 0.50), and when this happens, the heterozygote frequency will be 0.50 (2x0.50x0.50). Unless HWE is violated (as in selective loss of homozygotes), heterozygosity can never be more than 0.50 at any biallelic locus (see Definnetti Graph for the illustration of this),

4. An application of HWE is that when the frequency of an autosomal recessive disease (e.g., sickle cell disease, hereditary hemochromatosis, congenital adrenal hyperplasia) is known in a population and unless there is a reason to believe HWE does not hold in that population, the gene frequency of the disease gene can be calculated. See also Clinical Genetics. Likewise, the carrier rate may be calculated for autosomal recessive disorders if the disease gene frequency is known. For example, phenylketonuria (PKU) occurs in 1/11,000 (q²), which gives a heterozygote carrier frequency of approximately 1/50 [ 2x(1-q)q]. If diseased individuals (q²) are deducted from the whole population, the carrier rate in normal individuals approximates to [ 2q/1+q ].

It has to be remembered that when HWE is tested, mathematical thinking is necessary. When the population is found in equilibrium, it does not necessarily mean that all assumptions are valid since there may be counterbalancing forces. Similarly, a significant violation may be due to sampling errors (including Wahlund effect, see below and Glossary), misclassification of genotypes, measuring two or more systems as a single system, population substructure, failure to detect rare alleles and the inclusion of non-existing alleles. The Hardy-Weinberg laws rarely hold true in nature (otherwise evolution would not occur). Organisms are subject to mutations and selective forces, and they move around, or the allele frequencies may be different in males and females. The gene frequencies are constantly changing in a population, but the effects of these processes can be assessed by using the Hardy-Weinberg law as the starting point.

The direction of departure of observed from expected frequency cannot be used to infer the type of selection acting on the locus even if it is known that selection is acting. If selection is operating, the frequency of each genotype in the next generation will be determined by its relative fitness (W). Relative fitness is a measure of the relative contribution that a genotype makes to the next generation. It can be measured in terms of the intensity of selection (s), where W = 1 - s [0 ≤ s ≤ 1]. The frequencies of each genotype after selection will be p² W_AA, 2pq W_Aa, and q² W_aa. The highest fitness is always 1 and the others are estimated proportional to this. For example, in the case of heterozygote advantage (or overdominance), the fitness of the heterozygous genotype (Aa) is 1, and the fitnesses of the homozygous genotypes negatively selected are W_AA = 1 - s_AA and W_aa = 1 - s_aa. It can be shown mathematically that only in this case a stable polymorphism is possible despite selection. Other selection forms, underdominance and directional selection, result in unstable polymorphisms. The weighted average of the fitnesses of all genotypes is the mean fitness. It is important that genetic fitness is determined by both fertility and viability. This means that diseases that are fatal to the bearer but do not reduce the number of progeny are not genetic lethals and do not have reduced fitness (like the adult onset genetic diseases: Huntington chorea, hereditary hemochromatosis). The detection of selection is not easy because the impact on changes in allele frequency occurs very slowly and selective forces are not static (may vary even in the same generation as in antagonistic pleiotropy).

All discussions presented so far concern a simple biallelic locus like most SNPs. In real life, however, there are many loci which are multiallelic, and interacting with each other as well as with environmental factors. The Hardy-Weinberg principle is equally applicable to multiallelic loci but the mathematics is slightly more complicated. For multigenic and multifactorial traits, which are mathematically continuous as opposed to discrete, more complex techniques of quantitative genetics are required.

In a final note on the practical use of HWE, it has to be emphasized that its violation in daily life is most frequently due to genotyping errors (Gomes, 1999; Lewis, 2002; Sen & Burmeister, 2008). Allelic misassignments, as frequently happen when PCR-SSP method is used, sometimes due to allelic dropout (increased homozygosity) are the most frequent causes of Hardy-Weinberg disequilibrium. When this is observed, the genotyping protocol should be reviewed. Another common reason is the presence of an unknown allele which is not considered in the genotyping scheme (null allele). This happens when a variant is considered to be bi-allelic while it is actually multiallelic. In a case-control association study, it is of paramount importance that the control group is in HWE to rule out any technical errors and to avoid false-positive associations (Tiret, 1995; Schaid & Jacobsen, 1999). The violation of HWE in the case group, however, may be due to a real association (Nielsen, 1998; Lee, 2003; Czika & Weir, 2004). When Hardy-Weinberg disequilibrium is not due to a technical error, the statistical evaluation of the data should involve statistical tests using genotype frequencies rather than allele frequencies (Xu, 2002; Lewis, 2002). One such test is the Cochrane-Armitage trend test (to estimate common/pr allele odds ratio for additive model) and can be run online for SNP data (Online HWE & Association Testing). Departure from HWEmay be a substantialsource of error inExpectation-Maximization haplotype frequency estimation,simply because the algorithmrelies on HWE inits 'expectation' step (Fallin & Schork, 2000). For reviews of HWE in genetic epidemiology studies, see Gomes, 1999; Attia, 2003; Leal, 2005; Zou & Donner, 2006. See Testing for Consistency with HW Proportions at PEDSTATS: HWE Tutorial; HWE testing in Stata.

See also the following videos explaining HWE: HWE videos (by Education Portal) & HWE (by Bozeman Science).

Some concepts relevant to HWE

Wahlund effect: Reduction in observed heterozygosity (increased homozygosity) because of pooling discrete subpopulations with different allele frequencies that do not interbreed as a single randomly mating unit. When all subpopulations have the same gene frequencies, no variance among subpopulations exists, and no Wahlund effect occurs (F_ST= 0). Isolate breaking is the phenomenon that the average heterozygosity temporarily increases when discrete subpopulations make contact and interbreed (this is due to decrease in homozygotes). It is the opposite of Wahlund effect (Wahlund Effect & F-Statistics Simulation Applet).

F statistics: The F statistics in population genetics has nothing to do the F statistics evaluating differences in variances. Here F stands for fixation index, fixation being increased homozygosity resulting from inbreeding. Population subdivision results in the loss of genetic variation (measured by heterozygosity) within subpopulations due to their being small populations and genetic drift acting within each one of them. This means that population subdivision would result in decreased heterozygosity relative to that expected heterozygosity under random mating as if the whole population was a single breeding unit. Wright developed three fixation indices to evaluate population subdivision: F_IS (interindividual), F_ST (subpopulations), F_IT (total population).

F_IS is a measure of the deviation of genotypic frequencies from panmictic (large, randomly mating population with no substructure) frequencies in terms of heterozygous deficiency or excess. It is what is known as the inbreeding coefficient (f), which is conventionally defined as the probability that two alleles in an individual are identical by descent (autozygous). The technical description is the correlation of uniting gametes relative to gametes drawn at random from within a subpopulation (Individual within the Subpopulation) averaged over subpopulations. It is calculated in a single population as F_IS = 1 - (H_OBS / H_EXP) [equal to (H_EXP - H_OBS / H_EXP) where H_OBS is the observed heterozygosity and H_EXP is the expected heterozygosity calculated on the assumption of random mating. It shows the degree to which heterozygosity is reduced below the expectation. The value of F_IS ranges between -1 and +1. Negative F_IS values indicate heterozygote excess (outbreeding) and positive values indicate heterozygote deficiency (inbreeding) compared with HWE expectations.

F_ST measures the effect of population subdivision, which is the reduction in heterozygosity in a subpopulation due to genetic drift. F_ST is the most inclusive measure of population substructure and is most useful for examining the overall genetic divergence among subpopulations. It is also called coancestry coefficient (q) [Weir & Cockerham, 1984] or 'Fixation index' and is defined as correlation of gametes within subpopulations relative to gametes drawn at random from the entire population (Subpopulation within the Total population). It is calculated using the subpopulation (average) heterozygosity and total population expected heterozygosity. F_ST is always positive; it ranges between 0 = panmixis (no subdivision, random mating occurring, no genetic divergence within the population) and 1 = complete isolation (extreme subdivision). F_ST values up to 0.05 indicate negligible genetic differentiation whereas >0.25 means very great genetic differentiation within the population analyzed. F_ST is usually calculated for different genes, then averaged across all loci, and all populations. For human populations, the average value of F_ST for a large number of DNA polymorphisms is 0.139 (and 0.119 for non-DNA polymorphisms) (Cavalli-Sforza, 1994). Using the F_ST values, less differentiation is seen between human populations within continents than between continents. This is consistent with simple isolation by distance (the greater variability of populations from sub-Saharan Africa is the rule). F_ST can also be used to estimate gene flow: 0.25 (1-F_ST) / F_ST. This highly versatile parameter is even used as a genetic distance measure between two populations instead of a fixation index among many populations (see Weir BS, Genetic Data Analysis II, 1996 (Genetic Data Analysis III, 2007) and Kalinowski, 2002).

F_IT is rarely used. It is the overall inbreeding coefficient (F) of an individual relative to the total population (Individual within the Total population).

See also a Lecture Note on Analysis of Molecular Variance (AMOVA) is a method of estimating population differentiation directly from molecular data (and Genetic Epidemiology Glossary).

Detecting Selection Using DNA Polymorphism Data

Several methods have been designed to use DNA polymorphism data (sequences and allele frequencies) to obtain information on past selection events. Most commonly, the ratio of non-synonymous (replacement) to synonymous (silent) substitutions (d_N/d_S ratio; see below) is used as evidence for overdominant selection (balancing selection) of which one form is heterozygote advantage. Classic example of this is the mammalian MHC system genes and other compatibility systems in other organisms: the self-incompatibility system of the plants, fungal mating types and invertebrate allorecognition systems. In all these genes, a very high number of alleles is also noted. This can be interpreted as an indicator of some form of balancing (diversifying) selection. In the case of a neutral polymorphism, one common allele and a few rare alleles are expected. The frequency distribution of alleles is also informative. Large number of alleles showing a relatively even distribution is against neutrality expectations and suggestive of diversifying selection.

Most tests detect selection by rejecting neutrality assumption (observed data deviate significantly from what is expected under neutrality). This deviation, however, may also be due to other factors such as changes in population size or genetic drift (see Tripathy & Reddy, 2007 (Table 1) for a review in the context of G6PD deficiency). The original neutrality test was Ewens-Watterson homozygosity test of neutrality (see Glossary) based on the comparison of homozygosity and predicted value calculated by Ewens's sampling formula, which uses the number of alleles and sample size. This test is not very powerful.

Other commonly used statistical tests of neutrality are Tajima's D (theta), Fu & Li's D, D* and F. Tajima's test (Tajima, 1989) is based on the fact that under the neutral model estimates of the number of segregating/polymorphic sites and of the average number of nucleotide differences are correlated. If the value of D is too large (positive or negative), the neutral 'null' hypothesis is rejected. Tajima's D equals zero for neutral variation, is positive when an excess of rare polymorphism indicates positive selection and is negative in the excess of high-frequency variants, indicating balancing selection (Oleksyk, 2010). DnaSP calculates the D and its confidence limits (two-tailed test). Tajima did not base this test on coalescent but Fu and Li's tests (Fu & Li, 1993) are directly based on coalescent. The tests statistics D and F require data from intraspecific polymorphism and from an outgroup (a sequence from a related species), and D* and F* only require intraspecific data. DnaSP uses the critical values obtained by Fu & Li, 1993 to determine the statistical significance of D, F, D* and F* test statistics. DnaSP can also conduct the Fs test statistic (Fu, 1997). The results of this group of tests (Tajima's D and Fu & Li's tests) based on allelic variation and/or level of variability may not clearly distinguish between selection and demographic alternatives (bottleneck, population subdivision) but this problem only applies to the analysis of a single locus (demographic changes affect all loci whereas selection is expected to be locus-specific which are distinguishable if multiple loci are analyzed). Tests for multiple loci include the HKA test described by Hudson et al (1987). This test is based on the idea that in the absence of selection, the expected number of polymorphic (segregating) sites within species and the expected number of 'fixed' differences between species (divergence) are both proportional to the mutation rate, and the ratio of them should be the same for all loci. Variation in the ratio of divergence to polymorphism among loci suggests selection.

A different group of neutrality tests that are not sensitive to demographic changes include McDonald-Kreitman test (McDonald & Kreitman, 1991) and d_N/d_S ratio test. McDonald-Kreitman test compares the ratio of the number of coding region nonsynonymous to synonymous 'polymorphisms' within species to that ratio of the number of nonsynonymous to synonymous 'fixed' differences between species in a 2x2 table. The most direct method of showing the presence of positive selection is to compare the number of nonsynonymous (d_N) to the number of synonymous (d_S) substitutions in a locus. A high (>1) value of (d_N/d_S) substitutions suggest fixation of nonsynonymous mutations with a higher probability than neutral (synonymous) ones. Statistical properties of this test are given by Goldman & Yang, 1994 and by Muse & Gaut, 1994. The d_N/d_S ratio tests take into account of transition/transversion rate bias and codon usage bias.

For other tests and software to perform these statistics, see Statistical Tests of Neutrality (Lecture Note by P Beerli); Statistical Tests of Neutrality of Mutations against Excess of Recent Mutations (Rare Alleles); Statistical Tests of Neutrality of Mutations against an Excess of Old Mutations or a Reduction of Young Mutations; Estimation of theta; Innan & Tajima, 2002; Properties of statistical tests of neutrality for DNA polymorphism data, Simonsen, 1995. Review of statistical tests of selective neutrality on genomic data, Nielsen, 2001, Luikart, 2003, Harris & Meyer, 2006 and a Lecture Note by Gil McVean. See also a review by SP Otto (Detecting the Form of Selection from DNA Sequence Data. Trends Genet 2000).

Linkage disequilibrium (LD)

LD refers to the tendency for two 'alleles' to be present on the same chromosome (positive LD), or not to segregate together (negative LD). As a result, specific alleles at two different loci are found together more or less than expected by chance. LD is the nonindependence, at a population level, of the alleles carried at different positions in the genome. In this case, the expected frequency of a two-locus haplotype can be calculated as the probability of the occurrence of two independent (or joint) events simply by multiplying their gene frequencies.

The same situation may exist for more than two alleles. Its magnitude is expressed as the delta (D) value and corresponds to the difference between the expected and the observed haplotype frequency. If there is no LD, D will be zero (or not significantly different from zero), if there is positive LD it will be a positive value. It can also be negative if the two alleles tend not to occur together. The statistical significance of LD, which depends on the sample size, and the magnitude of LD are separate issues. The statistical significance is determined usually by Fisher test and the magnitude is determined by either D value or alternative measures of LD. The magnitude can be normalized (for allele frequencies) to have the same range of values for any frequency.

Ideally, the haplotype frequencies should be calculated from family typing data. Obviously, this gives the most accurate results. In practice, however, when family data are not available, D and two-locus haplotype frequencies are calculated from a sample of the population data by constructing 2x2 contingency tables for each allele pair. A contingency table for this purpose contains the individual (observed) values cross-classified by levels in two different attributes. A common 2x2 table constructed in genetic studies is as follows:

			*allele i*
		Present (+)		Absent (-)	*Row totals*
	Present (+)	a (+/+)		b (+/-)	a+b
*allele j*
	Absent (-)	c (-/+)		d (-/-)	c+d
	*Column totals*	a+c		b+d	N=a+b+c+d

Counts for each combination of levels (presence or absence) of the two factors (alleles) are placed in each cell. The corresponding D_ij is estimated by the formula known as the Mattiuz formula (most commonly used in HLA studies):

D_ij= (d/N)^1/2 – [((b+d)/N)((c+d)/N)]^1/2 (originally described by Bodmer & Bodmer, 1970; see also Schipper, 1998)

The haplotype frequency (HF_ij) equals to GF_ix GF_j+ D_ij, where GF is the gene (actually allele) frequency (the proportion of the chromosomes carrying a particular allele). The haplotype frequency calculated with this formula from the population data compares reasonably well with the estimates obtained directly from counting haplotypes constructed from family segregation data. This method generates a reliable estimate of a haplotype frequency with the exception of very small haplotype frequencies. Also for other parts of the genome, it has been reported that there is little or no advantage to constructing haplotypes from family data rather than unrelated individuals. The major point is that when using population data, genotyping errors become an issue. When genotyping a large number of markers, an error rate of only 1% will produce a large number of inaccurate haplotypes. Genotyping errors are not the only possible sources of accuracy problems. Other factors include sample size, allele frequency distributions and departures from HWE.

D is the most basic LD measure. There are other measures of LD. Because the value of D depends on allele frequencies, a normalization of D is needed. This is achieved by taking into account the allele frequencies: normalized delta value (D') = D_AB/ D max. D max is the lesser of pApb or papB if D is positive; or pApB or papb if D is negative. Because the sign is arbitrary, | D' | is often used rather than D'. Therefore, D' (normalized LD) is scaled to remove allele frequency effects. In a large enough sample, D' = 1 indicates complete LD and D' = 0 corresponds to no LD. |D'| is directly related to recombination fraction and its generalization to more than two loci is the only measure of LD not sensitive to allele frequencies. HAPLOVIEW and MIDAS are some of the software that calculate D' values.

Another statistic of linkage disequilibrium is the square of the correlation coefficient (r²; r-squared) between the alleles at locus A and B: r² = D²/ (pA pa pB pb) which can also be expressed as r² = D² / (pA (1-pA) pB (1-pB)) (for two loci with two alleles each). The measure r² has several properties that make it more useful (Pritchard, 2001; Weiss, 2002; Carlson, 2004). In brief, for low allele frequencies r² has more reliable sample properties than |D'|. The allelic association metric r (rho) has the strongest population theory basis, least sensitive to marker allele frequencies (Morton, 2001); and has several statistically optimal properties such as consistency, asymptotic unbiasedness and asymptotic efficiency (Shete, 2003). This parameter can be estimated on MIDAS for SNP data (along with other measures of LD). (See CIGMR; GENESTAT LD Measures; CGIL Summer Course Notes and Measures of LD by Hedrick, 1987; Lewontin, 1988; Devlin & Risch, 1995; Devlin, 1996; Morton, 2001; Wall & Pritchard, 2003; Shete, 2003; Jorde, 2003; Mueller, 2004; Zaykin, 2004; Wang, 2005; GOLD-Disequilibrium Statistics and a Lecture Note at UCL for further details.)

Statistical Methods for LD Estimation: While the Mattiuz formula can be used to calculate two-locus haplotype frequencies manually, an alternative method, the maximum likelihood estimation (achieved by EM ' expectation-maximization' algorithm), can be used if computing facilities are available. This test was originally described by Yasuda and Tsuji (1975), compared with other methods by Schipper et al (1998) and Excoffier et al. (1995). It can also be used for multiple-locus haplotypes (Long, 1995). One of the most sophisticated population genetic data analysis packages ARLEQUIN as well as EMLD use EM algorithm to calculate multilocus LD. Some of the other software to perform LD analysis are: MIDAS, HAPLOVIEW, Genetic Data Analysis (GDA), EH, MLD, DISEQ, SHEsis, GOLD and PopGene. The proprietary LD Software Golden Helix (manual) computes multilocus haplotype probabilities using the composite haplotype method (CHM) and the Expectation Maximization (EM) algorithm for SNP data. LD analysis can also be performed online (Online LD Analysis; Genotype2LDBlock; VG2). All of the above (pairwise) LD measures can be estimated from unphased SNP data on STATA using David Clayton's program pwld within genassoc.pkg. For HLA data, HWE and LD (as well as Ewens-Watterson test) can be assessed using PyPop.

Interpretation of LD Data

The patterns of LD observed in natural populations are the result of a complex interplay between genetic factors and the population's demographic history (Pritchard, 2001). LD is usually a function of distance between the two loci. This is mainly because recombination acts to break down LD in successive generations (Hill, 1966). When a mutation first occurs it is in complete LD with the nearest marker (D' = 1.0). Given enough time and as a function of the distance between the mutation and the marker, LD tends to decay and in complete equilibrium reached D' = 0 value. Thus, it decreases at every generation of random mating unless some process is opposing to the approach to linkage 'equilibrium'. However, physical distance could account for less than 50% of the observed variation in LD. One genetic phenomenon that affects LD is gene conversion (see Glossary). Gene conversion is an important mechanism in the breaking down of allelic associations over short distances, i.e., decay of LD. Other factors that influence LD include changes in population demographics (such as population growth, bottlenecks, geographical subdivision, admixture and migration) and selective forces. Admixture (intermixture of populations) would cause LD if the mixing populations have different allele frequencies. LD will also be erased faster in large populations than in small ones. Permanent LD may result from natural selection if some gametic combinations confer higher fitness than other combinations. An extraordinary example of the effect of recombination rates on LD is the discrepancy between genetic distance and physical distance between HFE and HLA-A, which generates strong LD despite 5Mb distance (Malfroy, 1997). Regional LD may also be variable according to haplotype. An example has been presented for HLA haplotypes. Haplotype-specific patterns of LD (Ahmad, 2003) may reflect haplotype-specific recombination hotspots as has been shown for mouse MHC.

Note that LD has nothing to do HWE and should not be confused with it (see Possible Misunderstandings in Genetics).

Genetic distance (GD)

Genetic distance is a measurement of genetic relatedness of samples of populations (whereas genetic diversity represents diversity within a population). The estimate is based on the number of allelic substitutions per locus that have occurred during the separate evolution of two populations. (See lecture notes on Genetic Distances, Estimating Genetic Distance; and GeneDist: Online Calculator of Genetic Distance. The software Arlequin, PHYLIP, GDA, PopGene and R are suitable to calculate population-to-population genetic distances from allele frequencies (Microsat is a microsatellite distance program).

Genetic distance can be computed on freeware PHYLIP (or using its R wrapper Rphylip). One component of the package GENDIST estimates genetic distance from allele frequencies using one of the three methods: Nei's, Cavalli-Sforza's or Reynold's (see papers by Cavalli-Sforza & Edwards, 1967, Nei, 1983, Nei, 1996 and lecture note (1) and (2) for more information on these methods). GENDIST can be run online using default options (Nei's genetic distance) to obtain genetic distance matrix data. The PHYLIP function CONTML estimates phylogenies from gene frequency data by maximum likelihood under a model in which all divergence is due to genetic drift in the absence of new mutations (Cavalli-Sforza's method) and draws a tree. If new mutations are contributing to allele frequency changes, Nei's method should be selected on GENDIST to estimate genetic distances first. Then a tree can be obtained using components of PHYLIP: NEIGHBOR also draws a cladogram using the genetic distance matrix data (from GENDIST). It uses either Nei and Saitou's (1987) "Neighbor Joining (NJ) Method," or the UPGMA (unweighted pair group method with arithmetic mean; average linkage clustering) method (Sneath & Sokal, 1973). Neighbor Joining is a distance matrix method producing an unrooted tree without the assumption of a clock (the evolutionary rate does not have to be the same in all lineages). Major assumption of UPGMA is equal rate of evolution along all branches (which is frequently unrealistic). Other components of PHYLIP that draw phylogenetic trees from genetic distance matrix data are FITCH (Fitch-Margoliash method with no assumption of equal evolutionary rate) and KITSCH (employs Fitch-Margoliash and Least Squares methods with the assumption that all tip species are contemporaneous, and that there is an evolutionary clock -in effect, a molecular clock). Another freeware PopGene calculates Nei's genetic distance and creates a cladogram using UPGMA method from genotypes.

Because of different assumptions they use, NJ and UPGMA methods may construct cladograms with totally different topologies. For an example of this and a review of main differences between the two methods, see Nei & Roychoudhury, 1993 (PDF). Both methods use distance matrices (also Fitch-Margoliash and Minimal Evolution methods are distance methods). The principle difference between NJ and UPGMA is that NJ does not assume an equal evolutionary rate for each lineage. Since the constant rate of evolution does not hold for human populations, NJ seems to be the better method. For the genetic loci subject to natural selection, the evolutionary rate is not the same for each population and therefore UPGMA should be avoided for the analysis of such loci (including the HLA genes). The leading group in HLA-based genetic distance analysis led by Arnaiz-Villena proposes that the most appropriate genetic distance measure for the HLA system is the DA value first described by Nei, 1983. Unlike UPGMA, NJ produces an unrooted tree. To find the root of the tree, one can add an outgroup. The point in the tree where the edge to the outgroup joins is the best possible estimate for the root position. One problem with tree construction might be the lack of statistical assessment of the phylogenetic tree presented. This is best done with bootstrap analysis originally described in Felsenstein J: Evolution 1985;39:783-91 and Efron, 1996; and reviewed in (Nei, 1996). For a discussion of statistical tests of molecular phylogenies, see Li & Gouy, 1990 and Nei, 1996. For the topology to be statistically significant, the bootstrap value for each cluster should ideally reach 70% (at least 50%, which may overestimate accuracy of the tree). Bootstrap tests should be done with at least 1000 (preferably more) replications.

Nei noted that some genes are more suitable than others in phylogenetic inference and that most tree-building methods tend to produce the same topology whether the topology is correct or not (Nei M, 1996). He also added that sometimes adding one more species/ population would change the whole tree for unknown reasons. An example of this has been provided in a study of human populations with genetic distances (Nei & Roychoudhury, 1993). The properties of most popular genetic distance measures have been reviewed (Kalinowski, 2002). Whichever is used, large sample sizes are required when populations are relatively genetically similar, and loci with more alleles produce better estimates of genetic distance. In a simulation study, Nei et al concluded that more than 30 loci should be used for making phylogenetic trees (Nei, 1983). There seems to be a consensus that estimated trees are nearly always erroneous (i.e., the topological arrangement will be wrong) if the number of loci is less than 30 (Nei, 1996; Jorde LB. Human genetic distance studies. Ann Rev Anthropol 1985;14:343-73). If populations are closely related, as high as 100 loci may be necessary for an accurate estimation of the relationships by genetic distance methods. Cavalli-Sforza et al have noted strong correlations between the genetic relationships and linguistics evolutionary trees with the exceptions for New Guinea, Australia and South America (Cavalli-Sforza, 1994).

Especially for the HLA genes, phylogenetic trees can be constructed by using Nei's DA genetic distance values and NJ method with bootstrap tests on DISPAN. Correspondence analysis, a supplementary analysis to genetic distances and dendrograms, displays a global view of the relationships among populations (Greenacre, 1984; Greenacre & Blasius, 1994; Blasius & Greenacre, 1998; Tekaia, 2016). This type of analysis tends to give results similar to those of dendrograms as expected from theory (Cavalli-Sforza & Piazza, 1975), and is more informative and accurate than dendrograms especially when there is considerable genetic exchange between close geographic neighbors (Cavalli-Sforza, 1994). In their enormous effort to work out the genetic relationships among human populations, Cavalli-Sforza et al concluded that two-dimensional scatter plots obtained by correspondence analysis frequently resemble geographic maps of the populations with some distortions (Cavalli-Sforza, 1994). Using the same allele frequencies that are used in phylogenetic tree construction, correspondence analysis using allele frequencies can be performed on the ViSta or VST, but most conveniently on Multi Variate Statistical Package MVSP (link to a Tutorial on Correspondence Analysis).