Genetics Clinical Genetics Population Genetics Genome Biology Biostatistics Epidemiology Bias & Confounding HLA MHC Glossary Homepage
GENETIC EPIDEMIOLOGY
Mehmet Tevfik DORAK
Genetic Epidemiology PowerPoint Presentation (PPT)
Bioinformatics for Genetic Epidemiologists Presentation (PPT) & Bioinformatics Tools
Genome Biology for Genetic Epidemiologists
Classical epidemiology deals with disease patterns and factors associated with causation of diseases with the ultimate aim of preventing the disease. Molecular epidemiologic studies measure exposure to specific substances (DNA adducts) and early biological response (somatic mutations), evaluate host characteristics (genotype and phenotype) mediating response to external agents, and use markers of a specific effect (like gene expression) to refine disease categories (such as heterogeneity, etiology and prognosis). Genetic epidemiology overlaps with molecular epidemiology. It is the epidemiological evaluation of the role of inherited causes of disease in families and in populations; it aims to detect the inheritance pattern of a particular disease, localize the gene and find a marker associated with disease susceptibility. Gene-gene and gene-environment interactions should also be studied in genetic epidemiology of a disease. The most widely accepted definition of genetic epidemiology is by Morton: ‘a science which deals with the etiology, distribution, and control of disease in groups of relatives and with inherited causes of disease in populations’ (Morton NE, 1982).
Genetic epidemiology was born in the 1960s as the merger of population/statistical/mathematical genetics and classical/molecular epidemiology. The pioneers include Newton Morton, Douglas Falconer, Robert C Elston, Elizabeth A Thompson and Neil Risch.
The steps, a genetic epidemiologic research follows, are:
1. Establishing that there is a genetic component to the disorder.
2. Establishing the relative size of that genetic effect in relation to other sources of variation in disease risk (environmental effects such as intrauterine environment, physical and chemical effects as well as behavioral and social aspects).
3. Identifying the gene(s) responsible for the genetic component.
All of these can be achieved either in family studies (segregation, linkage, association) or in population studies (association).
General methods employed in genetic epidemiology:
* Genetic risk studies: What is the
contribution of genetics as opposed to environment to the trait? Requires
family-based, twin/adoption or migrant studies.
* Segregation analyses: What does the genetic component look like (oligogenic
'few genes each with a moderate effect', polygenic 'many genes each with a
small effect', etc)? What is the model of transmission of the genetic trait?
Segregation analysis requires multigeneration family trees preferably with more
than one affected member.
* Linkage studies: What is the location of the disease gene(s)? Linkage
studies screen the whole genome and use parametric or nonparametric methods
such as allele sharing methods {affected sibling-pairs method} with no
assumptions on the mode of inheritance, penetrance or disease allele frequency
(the parameters). The underlying principle of linkage studies is the
cosegregation of two genes (one of which is the disease locus).
* Association studies: What is the allele associated with the disease susceptibility?
The principle is the coexistence of the same marker on the same chromosome in
affected individuals (due to linkage disequilibrium). Association studies may
be family-based (transmission / disequilibrium test - TDT; also called
transmission distortion test) or population-based. Alleles, haplotypes or
evolutionary-based haplotype groups may be used in association studies (Clark,
2004; Tzeng,
2005). Most recently, genome-wide association studies (GWAS) have been the
norm for most robust results (Clark, 2005; Wang, 2005; Pearson, 2008;
McCarthy,
2008; Psychiatric
GWAS Consortium Coordinating Committee, 2009 the WTCCC
GWAS (PDF),
a list of recent GWAS in OEGE, NHGRI (NIH): Catalog of GWAS; GWAS Central; GWAS
on HuGENet) (see also the Glossary)
.
The samples needed for these studies may be nuclear families (index case and parents), affected relative pairs (sibs, cousins, any two members of the family), extended pedigrees, twins (monozygotic and dizygotic) or unrelated population samples.
Genetic epidemiologic approach
When the question is whether a disease has a genetic component, the detection and estimation of familial aggregation (e.g., higher occurrence rates in siblings or offspring) is the first step in the approach. This may already be known from descriptive epidemiology studies. Results of observational studies on siblings, parent-offspring concordance, twins, adoptees and even migrants may suggest a genetic component in the etiology of a disease or trait. Familial aggregation of a trait is a necessary but not sufficient condition to infer the importance of genetic susceptibility, because environmental and cultural influences can also aggregate in families, leading to family clustering and excess familial risk. Similar environment may be the reason for familial aggregation. With rising divorce rates, study of recurrence risk in half-siblings is another powerful method to test for parent-specific events. In an application of this method, multiple sclerosis appeared to have a genetic basis transmitted more from mothers than fathers (Ebers, 2004).
Familial aggregation for a disease is measured by the relative recurrence risk (RRR) or familial risk ratios (FRRs). These are quantities denoted by lR, where R denotes a relationship (S=sib, O=offspring, DZ= dizygotic twin, C=cousin etc), and whose values are the risks of relatives of type R of affected individuals being themselves affected, divided by the population prevalence. Examination of relative recurrence risk values for various classes of relatives can potentially suggest a polygenic background and epistasis (Risch, 1990a; 2001). The estimation of ls is prone to ascertainment bias and should be performed with great care (Chakraborty, 1987; Guo, 1997). Stratification of risk by degree of relatedness (e.g. in siblings versus cousins) and comparisons with the risk to spouses living in the same household can help distinguish between genetic and non-genetic contributions to familial effects. A higher concordance rate for brother pairs than in father-son pairs suggests an X-linked recessive genetic background (as has been observed X-linked prostate cancer (Monroe, 1995)). Same sex pairs of affected relatives suggest X-linked recessive (males) or X-linked dominant (females; due to intrauterine loss of males) inheritance. The genes within pseudoautosomal regions (PAR) of the X and Y chromosomes have a unique segregation pattern that the disease can occur in either sex but affected sibs tend to be same sex. The mutant allele would consistently segregate, during male meiosis, with sexual phenotype. A man could possess the mutant allele on either his X or Y chromosome. If it is on the X chromosome, then only his daughters would inherit the allele, whereas if it resided on the Y chromosome, then only his sons would inherit the allele (Crow, 1994). Unequal frequencies of sex-discordant and sex-concordant affected sib pairs may be used as the basis of a test for linkage to the pseudoautosomal region (Horwitz & Wiernik, 1999).
Other traditional designs for distinguishing non-genetic shared family effects from genetic effects have been studies of twins and adoptees. Twin studies provide somewhat more specific information than recurrence risk ratios (see Australian Twin Study; Swedish Twin Registry; Netherlands Twin Register and St Thomas Hospital Twin Research & Genetic Epidemiology websites).
Twin studies have been traditionally used to estimate the genetic contribution to a trait through the comparison of monozygotic (MZ) pairs (who share all their genes) with dizygotic (DZ) twins (who share half of their genes in common). The greater similarity of MZ twins than DZ twins is considered evidence of genetic factors. A standard measure of similarity used in twin studies is the concordance rate. This can be pairwise (Pr) or probandwise (Cc) (McGue, 1992). The Pr concordance is a descriptive statistic and simply gives the proportion of affected pairs that are concordant for the disease. It is calculated as the proportion of twin pairs with both twins affected of all ascertained twin pairs with at least one affected: Pr=C/(C+D), where C is the number of concordant pairs and D is the number of discordant pairs. The probandwise (Cc) concordance is the proportion of affected individuals among the co-twins of previously ascertained index cases. It allows for double counting of doubly ascertained twin pairs and has the advantage of being interpretable as the recurrence risk in a co-twin of an affected individual. The following formula is used in estimation: Cc=2C/(2C+D). Probandwise concordance rate Cc is more preferable of the two measures.
In theory, complete genetic determination of a disease would equate to MZ twins having 100% concordance and DZ twins having 50% concordance (MacGregor, 2000). Naturally, this finding should be considered in relation to other evidence and always with a critical assessment of the assumptions behind the twin approach. Among the most central assumptions is that the two types of twins share to an equal extent the environmental experiences that are relevant for the development of the trait. For behavioral and psychiatric conditions, this assumption of equal environments has been shown to be true (Kendler, 1993) but challenged by Guo who showed that even in the complete absence of any genetic factor and bias, the greater environmental similarity alone in MZ twins can result in higher concordance rate in MZ twins than in DZ twins (Guo, 2001). MZ twin concordance rates of less than 100% emphasize the importance of environmental factors. In lung cancer, concordance rates are very similar in MZ and DZ twins suggesting the role of shared environment (smoking) rather than genetics. For examples of twin studies on environmental and genetic causes of cancer, see Lichtenstein, 2000 and Risch, 2001. Because twins share intrauterine circulation, 100% concordance rates in twins for leukemia (which may initiate prenatally) do not necessarily mean leukemia is strongly genetically determined but suggests transfer of a leukemia clone to the other twin.
The usual assumptions of a classic twin study are random mating, no interactions between genes and environment, and equivalent environments for MZ and DZ twins. In the study of a complex trait, phenotypic variance is divided into a component due to inherited genetic factors (heritability), a component due to environmental factors common to both members of the pair of twins (the shared environmental component), and a component due to environmental factors unique to each twin (the nonshared environmental component). Twin studies have been used to establish the presence of a genetic component in the etiology in many diseases (MacGregor, 2000) including celiac disease (Greco, 2002), Alzheimer disease (Raiha, 1996), schizophrenia (Sullivan, 2003) and cancer (Risch, 2001).
Alternatively, twins discordant for disease have been used to examine possible environmental causes. Adoption studies also permit the separation of childhood rearing effects from genetic effects by studying the similarity of adopted children with their biological and foster parents. The assumptions are that the resemblance between an adopted child and biological parent is due only to genetic effects, while that between the adopted child and the adoptive parent is only environmental in origin (see Magnusson, 1999 for an example in cervical cancer; and the Colorado Adoption Project website for adoptee studies). However, sometimes the representativeness of adoption studies can be questioned due to special circumstances surrounding adoption (adoption bias). In contrast, twins are born into all classes of society (Hublin & Kaprio, 2003). Migration studies also provide clues for genetic vs environmental causes (Parkin & Khjlat, 1996; see also Ecological Studies). If the incidence in migrants revert to the host population's incidence, this suggests stronger environmental factors in pathogenesis (as in diabetes (Drash, 1990), hypertension (He, 1991) and cancer (McCredie, 1998; Sasco, 2003)).
Once a genetic basis is established, the next step is to define the mode of inheritance of the trait/disease or to map the specific gene(s) contributing to the trait. The former is achieved by segregation analysis in families. Recurrence risk ratios within families allow an informal evaluation of the possible segregation modes. However, it is possible to use maximum likelihood techniques to test hypotheses representing different sources of genetic influence. For instance, is there a single major gene or are there many genes of small effect that influence the trait? Could there be two major genes that are interacting to cause variation in the trait? Segregation analysis can be used to answer these questions (see a Lecture Note by SA Monks at Washington University). Segregation analysis is most useful in single gene disorders due to a biallelic gene (Elston, 1981). When multiple loci with multiple alleles are involved as in most complex diseases, it becomes less powerful.
Segregation analysis estimates the genetic model of a trait by looking at multigenerational family data. The components of a genetic model are (1) transmission probabilities (the probability that a parental genotype transmits a particular allele to an offspring); (2) penetrance for each genotype; and (3) allele frequencies in the population (to determine prior probabilities of genotypes when inferring genotypes from phenotypes) (Thompson, 1986; Olson, 1999). In model-free methods, the frequencies and penetrance of disease genotypes need not be known in advance (Goldgar, 2001).
Segregation analysis is a prerequisite for linkage analyses. Segregation analysis reveals Mendelian inheritance patterns (autosomal or sex-linked and recessive or dominant); nonclassical inheritance (mitochondrial diseases, genomic imprinting, parent of origin effect, genetic anticipation etc); or non-Mendelian inheritance (no pattern) (see Clinical Genetics for details). Factors interfering with genotype-phenotype correlation such as incomplete penetrance, variable expressivity, confounding by other genes (allelic or locus heterogeneity, multigene inheritance, epistasis, modifier genes, sex influence, parental effect) or environmental factors, and nonclassic genetic phenomena (imprinting - parent of origin effect, mitochondrial inheritance) complicate the segregation analysis of complex diseases for which no inheritance pattern is obvious despite familial aggregation. Complex segregation analyses are based on more elaborate mathematical methods of genetic transmission and liability (Morton, 1971; Elston, 1981; Lalouel, 1983 (PDF); Elston, 1992; Jarvik, 1998). Families with large pedigrees and many affected individuals are particularly informative both for establishing that genes matter and for identifying specific genes (Terwilliger & Goring, 2000).
Recurrence risk ratio for relative pairs, twin concordance rates, and heritability coefficients are all functions not only of genetic effects and gene frequency, but also of environmental effects, the distribution of environmental factors in the population, and of gene-environment interactions (for a review of GEI, see North & Martin, 2008). Thus, establishing the genetic effects on disease occurrence should not rely on purely these kinds of family-based measures (Guo, 2000a).
The genes that make up the genetic component of a disease etiology can be localized by linkage (cosegregation) and following association studies identify the disease gene and its allele contributing to disease risk. If unknown locus underlying a phenotype and the marker being studied are linked, then pairs of relatives (usually sib pairs but can be uncle-nephew, grandparent-grandchild, half-sib or first-cousin pairs) concordant for the phenotype will tend to be similar with respect to the marker genotype (Risch, 1990b). For larger recombination fraction (q ) values, grandparent-grandchild pairs are best; for small relative recurrence risk values, sibs are best. Although intuitively it sounds feasible, affected-unaffected pairs generally represent a poor strategy (Risch, 1990b). In the absence of linkage, there is no reason for the similarity of relatives should correlate with their similarity for the marker genotype. Linkage analysis tests for such a correlation between phenotype and genetic marker similarity.
Linkage studies aim to obtain a crude chromosomal location of the gene or genes associated with a phenotype of interest, e.g. a genetic disease or an important quantitative trait. Linkage strategies include traditional ones (linkage analysis on pedigrees; allele-sharing methods: candidate genes, genome screen; animal models: identifying candidate genes) and newer ones (focus on special populations (Wright, 1999; Peltonen, 2000; Arcos-Burgos, 2002) such as Finland, Iceland, Newfoundland, Sardinia, Amish and Hutterites, haplotype-sharing; congenic/consomic lines in mice). For a general review, see genetic linkage in Kimball’s Biology.
Recombination fraction (denoted as q - theta) between a known genetic locus (marker) and an unknown disease locus (gene) lies at the heart of genetic linkage analysis (tutorial by F Clerget-Darpoux). If the two loci are far apart, segregation of one locus will be independent of the other (cosegregation and no-cosegregation are equally likely). At q =1/2 (0.50); four types of gametes (two recombinant and two nonrecombinant, a total of four from a pair of homologous chromosomes) are equally likely to be produced. This (q =1/2) is the baseline value in linkage studies as the proportion of gametes the person transmits following a recombination event has occurred (when two loci are far apart on the same chromosome, or in the extreme example, on different chromosomes, according to the law of independent assortment, they will segregate independently to different gametes and θ will be equal to 50% (Forabosco, 2005; and see tutorial by F Clerget-Darpoux)). Linked genes are, however, transmitted to the same gamete more than 50% of the time resulting in less than 50% of gametes being different from parental chromosomes. Thus, when 0 ≤ q < ˝, the 'parental-type' gametes are more frequent than the 'recombinant-type' gametes.
Two approaches to genetic linkage (and association) analysis have evolved for traits showing Mendelian and non-Mendelian segregation patterns:
(1) those that require prior specification of a genetic model (mode of inheritance) for the trait under study (model-based / parametric methods),
(2) those that do not assume a specific trait inheritance (model-free / nonparametric methods) (Risch, 1996; Elston, 1998; Olson, 1999; Goldgar, 2001). [Some model-free methods may be parametric.]
Model-based linkage analysis is based on a likelihood ratio, the logarithm of which is called a lod score. This is not the logarithm of the odds for linkage but the logarithm of the likelihood ratio for a particular value of the recombination fraction vs. free recombination (q = 0.50; q = 0 for two genes that are completely linked and 0.50 for unlinked genes) (Risch, 1992; Elston, 1998; Borecki, 2001). In model-based linkage analysis, all aspects of the statistical model other than the recombination fraction are (allele frequencies and penetrance) known. Then, the likelihood of q, for 0 ≤ q < 1/2, is divided by the likelihood of q = 1/2, to yield a likelihood ratio. The logarithm to base 10 of this likelihood ratio is the lod score. Obviously a lot of lod scores can be obtained for a range of q values. Whichever q maximizes the lod (and if the lod score is > +3), this is the evidence for linkage with the particular recombination frequency (q) between the marker and the disease locus (Elston, 2000). Traditionally, a lod score > +3 is considered to be significant (and -2 may be used as an exclusion criteria). This corresponds to a P value of 10-4 (one-sided). A number of software is available to analyze linkage in pedigree data, most commonly used ones are Linkage, Genehunter, Mendel, Merlin and Allegro. For a complete list, see Genetic Analysis Software List.
The simplest (and nonparametric) linkage analysis is to test whether the proportion of alleles the sibs share at a marker locus is greater than 1/2 (expected sharing in sibs). This 'test of cosegregation' is called the 'mean' test. In absolute linkage, 100% of concordant sibs will share the marker as opposed to 50% (see Gulcher, 2001 for a review). In the past, when the disease gene was not known and could not be analyzed directly, a linked genomic marker (usually a polymorphic gene) was used for prenatal risk estimation in a given family. The assumption would be that whichever linked allele the affected member of the family has, the fetus would have the same if carrying the disease gene. Examples include HLA-A linkage with hereditary haemochromatosis (usually HLA-A3) and HLA-B linkage with congenital adrenal hyperplasia (usually HLA-B47). Microsatellite loci in absolute linkage with then unknown disease genes were used in prenatal diagnosis of single-gene diseases in the past (eg, congenital adrenal hyperplasia, Wiskott-Aldrich syndrome etc). The methodology (maximum lod score, MLS) for affected-sib-pair linkage analysis was first described by Risch (1990) and reviewed by Holmans (1998).
Linkage and association studies are occasionally mixed up. They aim to address different questions and provide different answers. Linkage is a phenomenon of cosegregating loci, not alleles, within families. Linkage studies are used for coarse mapping as they have a limited genetic resolution of about 1 cM. If two markers are close, there will not be much recombination between them and they will cosegregate. This leads to finding linkage in a pedigree-based analysis. Association studies at the population level are the next step for fine mapping. Association may result from direct involvement of the gene or linkage disequilibrium (LD) with the disease gene at the population level. Linkage always leads to an association but this is usually intrafamilial with no association at the population level (linkage of genotype for a genetic marker to disease may be unique to the particular family). In other words, linkage does not necessarily mean a consistent association with a particular allele. Allelic association, on the other hand, may or may not be due to linkage (except when LD exists between the associated marker and the unknown disease gene, association is not due to linkage). Not all associations are due to a direct genetic mechanism, i.e. being close to a disease gene. This is an important point because the value of family-based TDT test depends on co-presence of linkage and association (Spielman, 1993; Thomson, 1995; Elston, 2000). If more than one allele or haplotype shows an association with a disease, the association might reflect linkage. Different HLA-AB haplotype associations in hereditary haemochromatosis, for example, reflected linkage of HLA-A and -B loci to the HFE gene.
While recombination fraction is what linkage studies rely on, linkage disequilibrium (LD) is the foundation of association studies. The assumption is that the genetic marker studied is close enough to the actual disease gene and this will result in an allelic association at the population level (Jorde, 2000; Weiss, 2002; Carlson, 2004; Morton, 2005). Another critical assumption of both association and LD mapping is that there is little allelic heterogeneity within loci. The magnitude of LD is affected by many factors but if everything else is assumed to be equal, the most important factor is the physical/genetic distance between the disease and marker alleles: the closer they are the lower the recombination frequency and the stronger the magnitude of LD. This implies that close linkage between the marker and disease loci would result in longer periods of LD within the population. For LD to be detectable, linkage need not be present; allelic or gametic association is a better term to describe the general phenomenon of LD. See Basic Population Genetics for more on LD.
Association studies focus on population frequencies, whereas linkage studies focus on concordant inheritance. One may be able to detect linkage without association when there are many independent trait-causing chromosomes in a population (i.e., no LD of the disease causing allele to a specific marker nearby); or association without linkage when an allele explains only a minor proportion of the variance for a trait, so that the allele may occur more often in affected individuals but does a poor job of predicting disease status within a pedigree (Lander & Schork, 1994). Association is usually with a 'susceptibility' locus, which increases the probability of contracting the disease but is not 'necessary' or 'sufficient' for disease expression. In this case, the marker will not show linkage in families. If an association is, however, with a marker in LD with a 'necessary' locus for disease development, then there will be evidence for linkage in family data (Greenberg, 1993; Greenberg & Doneshka, 1996). Linkage analysis is not useful for finding loci that are neither necessary nor sufficient for disease expression (so-called susceptibility loci).
Association studies have several practical advantages over linkage studies. As opposed to linkage studies, families with multiple affected individuals are not required and no assumptions are made about the mode of inheritance of the disease. In addition, association studies have considerable statistical power to detect genes of weak effects unlike linkage studies in families (Risch, 1996; Morton, 1998; Risch, 2000). Most significant factors independently associated with increased success in linkage studies are (a) an increase in the number of individuals studied and (b) study of a sample drawn from only one ethnic group (Altmuller, 2001). For association studies, large datasets, small P values and independent replication of results are important for reproducible results (Editorial, Nat Genet 1999; Dahlman, 2002). Use of ancestral haplotype groups in association studies (evolutionary-based association study design) is another way to increase power (Templeton, 1987; 1995; 2000; Schork, 1998; Seltman, 2003; Fejerman, 2004; Tzeng, 2005).
Successful examples of the use of linkage and association studies to locate and find disease susceptibility genes in complex diseases include rheumatoid arthritis (PTPN22 gene; R620W variant) (Gregersen, 2005) and Crohn disease (Ogura, 2001; Hugot, 2001; Todd, 2005).
Family-based vs population-based (case-control or prospective cohort) association studies
Family-based association studies (Thomson, 1995; Gauderman, 1999) include:
* Genotype Haplotype Relative Risk (GHRR) Method (Terwilliger & Ott, 1992)
* Haplotype Relative Risk (HRR) Method (Falk & Rubinstein, 1987; Knapp, 1993)
* Affected Family-Based Controls (AFBAC) Approach (Thomson, 1995)
* Transmission Disequilibrium/Distortion Test (TDT) (Spielman, 1993 & 1994; Ewens & Spielman, 1995; Clayton & Jones, 1999). See also sib-TDT (S-TDT) and extended-TDT (E-TDT).
TDT is robust to population stratification and can be performed in families used for linkage analysis. It is less powerful than case-control analyses because it requires 1/3 more individuals to observe the same effect (Risch & Teng, 1998; Long & Langley, 1999). TDT might be inconvenient to study diseases with late onset, but TDT-derived approaches have been developed like sib-TDT, which uses unaffected siblings as controls for affected individuals. An unaffected sibling does not have to exist for this design as parental chromosomes that have not been inherited to the affected child can be used as pseudocontrols (HRR method). TDT-based studies also benefit from the use of evolutionary-based haplotype analysis (ET-TDT) approach (Seltman, 2001). The real value of TDT is felt when population stratification cannot be controlled (Thomas, 2002; Wacholder, 2002). Because cases and controls are basically the same individuals (or their chromosomes), they originate from the same ethnicity. The TDT design tests the joint null hypothesis that there is no linkage and no allelic association (Spielman, 1993; Thomson, 1995; Elston, 2000). A significant result, chance findings aside, is due to the presence of both linkage and association. Like population stratification in population-based association studies, meiotic drive or transmission ratio distortion can cause spurious associations in TDT (see also FBAT).
Possible ascertainment problems in case-control studies in genetic epidemiology
* The sample should be representative of all cases. Inclusion of those identified at a hospital clinic may or may not be appropriate. They should be unrelated, incident (as opposed to prevalent) and consecutively diagnosed ones. If the prevalence of the disease is known, this would give an idea for the completeness of ascertainment (for a rare disease).
* If the disease requires medical attention only in some cases, recruitment from a hospital will be selective (usually for severe cases).
* If there is a survival effect of the disease (as in Alzheimer disease and ApoE), and if the associated allele also modifies the risk of death from competing causes, the age-dependent frequencies will be different. In this case, age-matching of cases and controls becomes particularly important.
* The controls should be comparable to cases except for having the disease. Local, contemporary controls should be selected via the same routes as the cases. For relevant diseases, age- and sex-matching (by frequency or one-to-one) may be important. The controls that have been self-selected like volunteer marrow or blood donors are not ideal controls. A more acceptable control group is a truly population-based one.
The pitfalls of conventional epidemiologic studies equally apply to molecular epidemiological research: selection bias, information bias and confounding (Vineis & McMichael, 1998; Campbell, 2002; Boffetta, 2003) with additional problems unique to genetic studies (Olson, 2000; Cordell, 2000; Elbaz & Alperovitch, 2002; Lee & Ho, 2003; Morimoto, 2003; Potter, 2003). In genetic association studies, missing data may be distributed differentially between cases and controls and may generate spurious associations (Clayton, 2005). This bias may be due to having subsets of DNA samples extracted using different chemistries that influence the performance of the assay differentially. See Pitfalls in Genetic Association Studies & Bias in Studies of the Human Genome by TA Pearson (PPT).
One potential problem is that estimates of genetic effect are subject to confounding when cases and controls differ in their ethnic backgrounds (population stratification bias or confounding by ethnicity). This can occur when both disease risk and genetic mutation frequencies vary among ethnic groups (Thomas, 2002; Wacholder, 2002; Cardon, 2003). To avoid the problem of population stratification bias, matching cases to controls on ethnic background, stratification, family-based association studies or genomic controls (Devlin, 1999; Pritchard, 1999) can be used.
For more details on genetic association studies, see Statistical Analysis of Genetic Association Studies.
Genetic epidemiology of complex diseases
The term complex trait/disease refers to any phenotype that does not exhibit classic Mendelian inheritance attributable to a single gene although they may exhibit familial tendencies (familial clustering, concordance among relatives). The contrast between Mendelian diseases and complex diseases involves more than just a clear or unclear mode of inheritance. In Mendelian diseases, the risk to relatives decreases by a factor of ˝ with each degree of relationship (from first to second to third degree) but in complex diseases the risk decreases more rapidly (Risch, 1990a). Other hallmarks of complex diseases include known or suspected environmental risk factors; seasonal, birth order, and cohort effects; late or variable age of onset; and variable disease progression. Many complex diseases are hard to diagnose accurately; even quantitative traits such as hypertension often involve sizable measurement errors (Guo, 2000b). Ultimate analysis of complex traits requires sophisticated statistical designs incorporating all genetic and nongenetic variables, their interactions, and familial correlations. In general, linkage is harder to show in a complex disease than a Mendelian disorder (Risch, 1992). A complex disease can be modeled in two different ways: (1) an additive model, closely approximates genetic heterogeneity, is characterized by no interlocus interaction, and (2) a multiplicative model, representing epistasis (interaction) among loci (Risch, 1990a). It should be recognized that, simple genetic traits are also complex when examined closely (Estivill, 1996) (see also Rannala, 2001 for a comprehensive review of complex disease genetics and the Journal of Clinical Investigation: Reviews on Complex Genetic Disorders (2005)).
Common susceptibility alleles in rare complex diseases?
One popular hypothesis proposes that the genetic factors underlying common diseases will be alleles that are themselves quite common in the population at large (Lander, 1996; Chakravarti, 1999 (PDF); Pritchard, 2001 (PDF)). When several different loci contribute to a phenotype (such as a complex disease), it is likely that the alleles at loci responsible for such interactions have high frequencies in populations (Carlson, 2004). If, for example, six genes contribute equally to a disease with an incidence of 1.5%, each susceptibility allele must have a population frequency of around 50%. Thus, modest-risk gene variants involved in polygenic diseases are often likely to be normal alleles from unsuspected loci that have relatively high frequencies (Reich, 2001). The identification of normal polymorphisms is of great importance for medical genetics (Cavalli-Sforza, 1998). However, rare coding region alleles are commonly deleterious and their contribution to the development of complex diseases is obvious (Kryukov et al, 2007; see also Ropers, 2007).
Genetic Models
* Single major locus: dominant, recessive
* Multifactorial / polygenic (Falconer, 1965 & 1967) in which dominant, recessive, additive, multiplicative genetic models may be possible for each locus. Dominance model is different from these and refers to association with heterozygous genotype (which does not get stronger in homozygotes; see also heterozygote advantage in Glossary).
* Mixed model: a single major locus with a polygenic background (Morton & McLean, 1974; Elston & Rao, 1978). For more on genetic models, see Introductory Statistical Genetics (PPT); Lewis, 2002.
Polygenic model has its origins from Fisher's work which concluded that 'many small, equal and additive loci' would result in Gaussian distribution for a phenotype (Fisher RA. Transactions of the Royal Society of Edinburgh 1918;52:399-433). Falconer was the first to introduce the idea of a normally distributed, quantitative trait as the 'liability' for a genetically determined disorder (Falconer, 1965). When environmental factors are known to influence the phenotype, the model is called multifactorial (polygenes and environment). The presence or absence of the phenotype is determined by a threshold, T (Falconer's multifactorial liability threshold model; multifactorial threshold model). This model applies to most complex diseases (like diabetes, hypertension, schizophrenia, cancer) where both multiple genes and environmental factors play a role in the development of the disease ((for a review of GEI, see North & Martin, 2008). These disorders are presumed to result from additive effects of multiple genes with low penetrance. Individual mutations may not have any particular phenotype, but when act in concert and in the presence of the necessary environmental conditions, they may produce a disease phenotype. Under a model of multiple interacting loci, no single locus could account for more than a five-fold increase in the risk of first-degree relatives. The disease shows increased incidence in families but with no recognizable inheritance pattern.
The features of multifactorial inheritance are:
- The more severe the condition, the greater the risk to sibs,
- Carter effect: the sibs or offspring of a patient in less commonly affected sex have higher susceptibility to the disease,
- If it is a rare disease, the frequency of the disease among relatives is higher,
- If more than one individual in a family is affected, recurrence risk is higher,
- The risk falls rapidly as one passes from 1st to 2nd degree relatives.
It was later appreciated that there were at least two thresholds for many diseases -differing by sex or causing different severity- (Reich, 1972). Examples include pyloric stenosis (sex dimorphism for liability) (Chakraborty, 1986) and orofacial cleft syndrome / cleft lip and palate (two thresholds for fetal mortality and disease) (Dronamraju, 1982 & 1983). The latter model proposes a lower threshold level of liability resulting in a cleft formation and a higher level causing a fetal death (preferentially males). The reason that multifactorial threshold model is not very popular currently is that it has been replaced by the concept of genetic heterogeneity.
Each model has their assumptions and parameters. Multifactorial and mixed models are analyzed the same way as statistical modeling of polytomy as a function of continuous variables. In both models, liability is assumed to be continuous (representing the sum of a large number of independent genetic and environmental factors) and normally distributed within the population. Another assumption is that all correlations between relatives are due to shared genes but not to shared environment. Multifactorial modeling may fail due to a variety of reasons. These include the presence of a major gene, invalidity of assumptions of normality and shared genes, presence of dominance and epistatic interaction (multiplicative as opposed to additive effect). Multifactorial complex traits that have multiple genetic determinants in one population may show simpler inheritance patterns in another due to allele distribution differences. The genetic basis of a multifactorial disease is that a genetically susceptible individual may or may not develop the disease depending on the interaction of a number of risk factors, both genetic and environmental (Risk Estimation for Multifactorial Diseases. ANN ICRP, 1999; Ottman, 1996; Cooper, 2003).
As all model fitting applications, genetic modeling is an imperfect science. For this reason and also because of the increasing knowledge of the human genome, the trend is changing to skipping initial steps of exploring genetic predisposition to specific candidate genes (Tabor, 2002) or genome-wide association studies (Clark, 2005).
Links
Online Encyclopedia of Genetic Epidemiology
International Genetic Epidemiology Society
Courses
Wellcome Trust Genome Campus Advanced Courses: Genetic Analysis of Multifactorial Diseases
Online Resources
Wellcome Trust Centre for Human Genetics: Course on the Design and Analysis of Disease-Marker Association Studies
Wellcome Trust Genome Sequence & Variation Course Manual (2003) (Other Course Manuals)
HuGE Navigator: Cancer Genome-wide Association and Meta Analyses Database / Variant Name Mapper / Genopedia / Phenopedia
Genetic Epidemiology / Bioinformatics / Statistics courses by Kristel Van Steen
EFG: Statistical Genetic Analysis of Complex Phenotypes Course Notes (2005) incl. Case-Control Association Studies by CM Lewis
Introduction to Genetic Linkage and Association Course Notes by Dave Curtis
Genetic Epidemiology SuperLecture by Kevin Kip
Introduction to Genetic Epidemiology Lecture by Hermine Maes
Human Molecular Genetics (Strachan & Read; 1999): Genetic Mapping of Complex Characters & Complex Disease Genetics
Centre for Integrated Genomic Medical Research (CIGMR): Statistical Genetic Analysis
GENESTAT: Genetic Association Studies Portal
Statistics for Genome-wide Association Studies by Laurent Briollais @ Bioinformatics.ca: PDF | PPT
CDC Genomics and Disease Prevention Center: Reports & Publications
Basic Molecular Genetics for Epidemiologists
Genetic Epidemiology Studies on Twins by Nick Martin
Quantitative Genetics Lecture Notes (slide presentation)
Human Genetics Interactive Learning Exercises
Genetic Calculation Applets by Knud Christensen (including heritability and variance components)
NIH National Human Genome Research Institute Programs (ENCODE; Genetic Variation)
SNP@Ethnos: a database of ethnically variant SNPs (Park, 2007)
Online Encyclopedia for Genetic Epidemiology Studies
Glossary on Genetic Epidemiology: Basic & Advanced
STROBE (STrengthening the Reporting of OBservational studies in Epidemiology) & Checklists
STREGA (STrengthening the REporting of Genetic Associations)
Suggested Reading
Books
* Armitage P & Colton T. Encyclopedia of Biostatistics. Volumes 1-8. John Wiley & Sons, 2005
* Bishop T & Sham P: Analysis of Multifactorial Diseases. Academic Press, 2000
* Clayton D & Hills M. Statistical Models in Epidemiology. OUP, 1993
* Costa LG & Eaton DL. Gene-Environment Interactions. Wiley Online, 2006
* Elston RC, Olson J, Palmer L (Eds) Biostatistical Genetics and Genetic Epidemiology. John Wiley, 2002
* Falconer DS & Mackay TF. Introduction to Quantitative Genetics. 4th Ed. Essex, UK: Longman, 1996
* Hedrick PW. Genetics of Populations. 3rd Ed. Boston, MA: Jones and Bartlett, 2004
* Khoury MJ, Beaty TH, Cohen BH (Eds) Fundamentals of Genetic Epidemiology. New York: Oxford University Press, 1993
* Khoury MJ. Fundamentals of Genetic Epidemiology (Book Chapter) 1997
* Lange K. Mathematical and Statistical Methods for Genetic Analysis. Springer-Verlag New York Inc, 2002
* Morton NE. Outline of Genetic Epidemiology. Basel: Karger, 1982
* Neale B, Ferreira M, Medland S, Posthuma D. Statistical Genetics: Gene Mapping Through Linkage and Association. Oxford: Taylor & Francis, 2007 (Amazon)
* Rao DC & Province MA. Genetic Dissection of Complex Traits (Advances in Genetics, Vol. 42). Academic Press, 2000
* Sham P. Statistics in Human Genetics. Hodder Arnold, 1997
* Thomas DC. Statistical Methods in Genetic Epidemiology. Oxford University Press, 2004
* Weir B. Genetic Data Analysis III. Sinauer Associates Inc, 2009
* Yang MC. Introduction to Statistical Methods in Modern Genetics. CRC, 2000
* Zeggini & Morris. Analysis of Complex Disease Association Studies. Elsevier, 2011 (e-book)
* Ziegler A & Koenig IR. A Statistical Approach to Genetic Epidemiology. Wiley, 2006
* Human Genome Epidemiology Online Book (CDC, 2010)
* Genetics and Public Health in the 21st Century. Oxford University Press, 2000
* Handbook of Statistical Genetics 2nd Edition, 2004; Online (Balding, Bishop, Cannings; John Wiley & Sons)
Articles
* Collected Papers of R.A. Fisher (including Statistical Methods in Genetics, Heredity 1952)
* Complex Disease Trait Mapping & Linkage Analysis Journal Watch
* Nature Reviews Genetics Web Focus: Statistical Analysis: Editorial, ( I ), ( II ), ( III ), ( IV )
* The Lancet Septet on Genetic Epidemiology: Editorial, ( I ), ( II ), ( III ), ( IV ), ( V ) , ( VI ), ( VII )
* Journal of Clinical Investigation Reviews on Complex Genetic Disorders (2005) incl. Gordon D & Finch SJ. Factors affecting statistical power in the detection of genetic association.
* Thompson EA. Genetic epidemiology: a review of the statistical basis. Stat Med 1986;5: 291-302
* Thompson EA. Likelihood and linkage: from Fisher to the future. Ann Stat 1996;24:449-65 (JSTOR-UK)
* Morton NE. Genetic epidemiology (review). Ann Hum Genet 1997;61:1-13
* Balding DJ: Tutorial on Statistical Analysis of Population Association Studies. Nat Rev Genet 2006;7:781-91
* Bochud M & Rousson V. Usefulness of Mendelian Randomization in Observational epidemiology. Int J Environ Res Public Health 2010;7:711–28
* Elston RC. Introduction and overview. Statistical methods in genetic epidemiology. Stat Methods Med Res 2000;9:527-41
* Elston RC. Segregation analysis. Adv Hum Genet 1981;11:63-120
* Elston RC. Linkage and association. Genet Epi 1998;15:565-76
* Kaprio J. Genetic Epidemiology. BMJ 2000; 320:1257-9
* Little J et al. Reporting, appraising, and integrating data on genotype prevalence and gene-disease associations. Am J Epidemiol 2002;156:300-310
* North KE & Martin LJ: The importance of gene environment interaction: implications for social scientists. Sociological Methods & Research 2008;37(2):164-200
* Rannala B. Finding genes influencing susceptibility to complex diseases in the post-genome era. AJPG 2001;1:203-221
* Risch N. Searching for genetic determinants in the new millennium. Nature 2000;405:847-56
* Schork NJ. Genetics of complex disease. Am J Respir Crit Care Med 1997;156:S103–9
* Sellers TA & Yates JR. Review of proteomics with applications to genetic epidemiology. Genet Epidemiol 2003;24:83-98
* Tabor HK, Risch NJ, Myers RM. Perspective on candidate-gene approaches for studying complex genetic traits. Nature Rev Genet 2002;3:1-7
* Carlson CS et al. Mapping complex disease loci in whole-genome association studies. Nature 2004;429:446-52
* Clayton D & McKeigue PM. Epidemiological methods for studying genes and environmental factors in complex diseases. Lancet 2001;358:1356-60
* Burton PR et al. Key concepts in genetic epidemiology. Lancet 2005;366:941-51
* Houlston RS & Peto J. The search for low-penetrance cancer susceptibility alleles. Oncogene 2004;23(38):6471-6
* Editorial: Freely Associating. Nat Genet 1999;22:1-2
* Ellsworth DL & Manolio TA. Importance of Genetics in Epidemiologic Research Part I - Part II - Part III. Ann Epidemiol 1999
* Whittemore AS & Nelson LM: Study Design in Genetic Epidemiology. Journal of the National Cancer Institute Monographs 1999: 26;61-9
* Yu W et al: Phenopedia and Genopedia. Bioinformatics 2010;26(1):145-6
* Advances in Genetics Vol. 42, 2001 [Rao DC & Province MA (Eds): Genetic Dissection of Complex Traits. ISBN 0-12-017642-4]
- Rao DC. Genetic dissection of complex traits: an overview. Advances in Genetics 2001;42:13-34
- Rice TK, Borecki IB. Familial resemblance and heritability. Advances in Genetics 2001;42:35-44
- Borecki IB, Suarez BK. Linkage and association: Basic concepts. Advances in Genetics 2001;42:45-68
- Province M. Linkage and association with structural relationships. Advances in Genetics 2001;42:183-190
- Schork NJ, Fallin D, Thiel B, Xu X, Broeckel U, Jacob HJ, Cohen D. The future of genetic case-control studies. Advances in Genetics 2001;42:191-212
- Goldgar DE. Major strengths and weaknesses of model-free methods. Advances in Genetics 2001;42:241-54
- Morton NE. Complex inheritance: the 21st century. Advances in Genetics 2001;42:535-43
* Int J Epidemiol Oct 2004 Issue: Special Theme - Genetic Epidemiology
* Annals of the ICRP 1999;29(3-4): Risk Estimation for Multifactorial Diseases
* Statistical Methods in Medical Research, Vol.9, No.6, 2000 (Genetic Epidemiology Issue)
* Phil. Trans. R. Soc. B: Genetic Variation and Human Health Issue (2005) (open access)
* JAMA: Genetics Issue (2008)
* Human Molecular Genetics: Complex Diseases (April 2004)
* Human Molecular Genetics: Genome-wide Association Studies Issue (October 2008)
* Human Heredity: Gene-Gene and Gene-Environment Interaction in Complex Trait Genome Wide Association (February 2007)
Journals
American Journal of Human Genetics
Annual Review of Genomics & Human Genetics
European Journal of Human Genetics
Frontiers in Applied Genetic Epidemiology
Twin Research & Human Genetics
Multimedia
Genetics for Epidemiologists: Short Course Presentations
Henry Stewart Talks: Genetic Epidemiology I & II
UAB Statistical Genetics Short Course Lectures on Video
Applied Genetic Epidemiology
Genetic Epidemiology of Cancer
Genetic Epidemiology of Prostate Cancer
Genetic Epidemiology of Infectious Diseases
Genetic Epidemiology of Coronary Heart Disease
Genetic Epidemiology of Coronary Atherosclerosis
Genetic Epidemiology of Essential Hypertension
Genetic Epidemiology of Type 1 Diabetes
Genetic Epidemiology of Inflammatory Bowel Disease
Genetic Epidemiology of Multiple Sclerosis (PubMed); see also Willer, 2003 & Ebers, 2004
Genetic Epidemiology of Psoriasis & Autoimmunity
Genetic Epidemiology of Osteoarthritis
Genetic Epidemiology of Orthopedic Conditions
Genetic Epidemiology of Alzheimer Disease
Genetic Epidemiology of Neurodegenerative Disease
Genetic Epidemiology of Myopia
Genetic Epidemiology of Glaucoma
Genetic Epidemiology of Bipolar Disorder
Genetic Epidemiology of Rheumatoid Arthritis
See also: The North American Rheumatoid Arthritis Consortium (NARAC)
Seven Examples of Applied Genetic Epidemiology of Complex Diseases in Human Molecular Genetics by Strachan & Read
Research at Melbourne University Centre for Genetic Epidemiology
Research at Erasmus MC Genetic Epidemiology Program (Netherlands)
MSc in Genetic Epidemiology (Sheffield, UK)
MSc in Bioinformatics (Oxford, UK)
MSc in Genetic Epidemiology and Bioinformatics (Cardiff, UK)
Software
Genetic Epidemiology Programs for Stata
Comprehensive List of Genetic Analysis Software
NCBI: Software for Genetic Linkage Analysis
Statistical Analysis in Genetic Epidemiology (S.A.G.E.) Software Package
Software written by Broad Institute Scientists (incl EIGENSRAT, Haploview, Heritability Calculator, PLINK, SNAP)
Software written by Reich Laboratory (inc EIGENSOFT)
Software written by London Statistical Genetics Group
Software by UCLA Human Genetics Department (incl Mendel Suit, see documentation)
Computational Genetics Laboratory (Gene-gene Interaction and Epistasis Analysis Software) (NH, USA)
Statistical Genetics Programs by JH ZHAO (MRC Epidemiology Unit, Cambridge, UK)
Statistical Genetics Programs by D CURTIS (Barts, London, UK)
Statistical Genetics Programs by D CLAYTON (STATA) (Cambridge, UK)
Statistical Genetics Programs by F DUDBRIDGE (Cambridge, UK)
Statistical Genetics Programs by M STEPHENS (Washington University, USA)
Statistical Genetics Programs by J PRITCHARD (Chicago University, USA)
Statistical Genetics Programs by G ABECASIS (Ann Arbor, USA)
Statistical Genetics Programs by NJ COX (Chicago University, USA)
Statistical Genetics Programs by A THOMAS (Utah, USA)
Statistical Genetics Programs by S PURCELL (Boston, USA)
Statistical Genetics -SAS- Programs by M KNAPP (Bonn, Germany)
Statistical Genetics Programs by JR GONZALES (CREAL, Barcelona)
Statistical Genetics Programs by CHAN M-S (Oxford, UK)
Statistical Genetics Programs by J MARCHINI (Oxford, UK)
Mathematical Genetics Programs by G McVEAN (Oxford, UK)
Quantitative Genetic Epidemiology Programs (Fred Hutchison Cancer Research Center, USA)
Statistical Genetics Research Group Software (Liege, Belgium)
Tools for Genome Mapping of Disease (Kwiatkowski Laboratory, Oxford, UK)
Wellcome Trust Centre for Human Genetics (Oxford, UK)
UAB SOPH Section on Statistical Genetics (Birmingham, AL)
UAB SOPH Section on Statistical Genetics: Association & TDT Programs (Birmingham, AL)
UAB SOPH Section on Statistical Genetics: Software for Genotyping Error Checking (Birmingham, AL)
Carnegie Mellon University Bioinformatics & Statistical Genetics Group Software (Pittsburgh, USA)
University of Pittsburgh Division of Statistical Genetics Software (Pittsburgh, USA)
University of Pittsburgh Medical Center Computational Genetics Laboratory (Pittsburgh, USA)
North Carolina State University Statistical Genetics Software (NC, USA)
University of Michigan Center for Statistical Genetics Software (MI, USA)
Computational Genetics Laboratory at the Norris-Cotton Cancer Center and Dartmouth Medical School (NH, USA)
University of Southampton Genetic Epidemiology Software (Southampton, U.K.)
Mayo Clinic Statistical and Genetic Epidemiology (Rochester, MN, USA)
Gene[VA] Tools for Genetic Data Analysis (AGP, Universite de Geneve, Switzerland)
Mathematical Genetics Software (Oxford, UK)
Linkage Disequilibrium Software (Oxford, UK)
Statistical Genetics and Marker Assessment Programs (Oxford, UK)
Genome-wide Association Study Software (Oxford, UK)
Edinburgh University Medical Genetics Software (Edinburgh, UK)
Pedigree Analysis for Genetics (PANGAEA) (Washington University, USA)
Variation Discovery Resource Software (UW-FHCRC)
Software and Databases at the Institute of Cancer Research (London, UK)
MRC Rosalind Franklin Centre for Genomic Research -RFCGR- Software Compilation
Software at Dyer Laboratory for Population Genetics (including AMOVA)
Software at Erasmus MC Genetic Epidemiology Program (Netherlands)
Genetic Calculation Applets by Knud Christensen (including heritability and variance components)
Online Encyclopedia of Genetic Epidemiology Studies Software
Genetic Analysis Manual Introduction at the Duke Center for Human Genetics
S.A.G.E. MERLIN MENDEL ARLEQUIN AMOVA LOTUS (manual; Nickolov & Milanov, 2007) MDR (tutorial; Hahn, 2003) UNPHASED (manual; ref) PHASE HAPSTAT CLUMP CLUMPHAP BLADE (Bayesian Haplotype LD Mapping) BLADE (ONLINE) SimWalk2 (alternative) GenTools SNPTagger TAG 'n' TELL WinBUGS GENECOUNTING (ONLINE) EHPLUS PHYLIP STRUCTURE & STRAT L-POP ADMIXMAP ANCESTRYMAP AFBAC GOLD HAPLOVIEW (tutorial) MIDAS SCORE (R) Genotype2LDBlock TFPGA GENEPOP GDA DISEQ SHEsis (ref) PyPop ONLINE HWE and ASSOCIATION TESTING (SNP) SNPator SNPLINK SNP Control SNPedia MedRefSNP PRESTO OEGE HWE CALCULATOR MBMDR TDT & S-TDT E-TDT ET-TDT FBAT DMLE-LD Mapping ALLASS LDMAP TRIMHAP SNPHAP GENEPI TOOLBOX
GWAS-related software: PLINK (manual; PLINK incorporated WHAP) & BEAGLE GWASpi (Ref) WGAviewer (Ge, 2008; Workshop Notes) EIGENSOFT (PCA) GenABEL (tutorial) SNPassoc (Ref) MAVEN (Ref) INTERSNP (Ref) IMPUTE2 SNPTEST GWIS (Ref) BiForce Toolbox (Ref; Manual; Quick Start Guide) KING (Kinship-based INference for Gwas; Ref; Manual)
GTOOL (Data Transformation Tool) PGDSpider (Data Conversion Tool; Manual)
GWAS databases: NHGRI-EBI GWAS Catalog; PhenoScanner; NCBI dbGAP (Association Results Browser); GWASdb; GWAS Central; Japan: GWAS Database
Address for bookmark: http://www.dorak.info/epi/genetepi.html
Last edited on 13 April 2018
Genetics Clinical Genetics Population Genetics Genome Biology Biostatistics Epidemiology Bias & Confounding HLA MHC Glossary Homepage