Statistics in HLA Association Studies [M.Tevfik DORAK]

HLA AND DISEASE ASSOCIATION STUDIES

M.Tevfik Dorak

Genetic association studies: Design, analysis & interpretation by C Lewis (Brieg Bioinform 2002)

Case–Control Genetic Association Analysis by W Li (Brief Bioinform 2008)

The Lancet Septet on Genetic Epidemiology: Editorial, ( I ), ( II ), ( III )*, ( IV ), ( V )**, ( VI ), ( VII )

* Genetic Association Studies (Cordell & Clayton, 2005)

** What Makes a Good Association Study (Hattersley & McCarthy, 2005)

Nature Reviews Genetics Web Focus: Statistical Analysis: Editorial, ( I )***, ( II ), ( III ), ( IV )

*** Tutorial on Statistical Analysis of Population Association Studies (DJ Balding, 2006) (PDF)

Population-Based Association Studies Lecture by David Clayton

'Reporting, Appraising, and Integrating Data on Genotype Prevalence and Gene-Disease Associations (2002)'

'Human Genome Epidemiology Online Book (2004)'

What is an association study?

Linkage and association studies are the two major types of investigations to determine the contribution of genes to disease susceptibility (or any other phenotype). While linkage studies can only use family data, association studies can be family or population based. In addition to having wider applications, association studies are consideredto be more sensitive (having greater power) thanlinkage methods in a comparably sized study. Allelic association is useful to refine the location of major genes prior to positional cloning. This has been the main use of association studies: to map the unknown disease gene, which is presumed to be in LD with the associated gene, and to proceed to positional cloning of the disease gene (see Genetic Epidemiology). With the information obtained from the Human Genome Project, the focus of association studies is changing to examining the role played by recently discovered genes in disease development (Burke, 2003).

HLA genes have been known for a long time and the whole of the HLA complex has been sequenced (The MHC Sequencing Consortium, 1999; Allcock, 2002; Shiina, 2004; Horton, 2004), so why are we still doing HLA association studies? One of the many reasons is to identify disease-specific susceptibility (risk) and protective markers that can be used in immunogenetic profiling, risk assessment and therapeutic decisions. Perhaps a more important reason is to refine already known or constantly emerging associations in the light of expanded knowledge of the HLA genetic map. In the beginning, only 'classic' HLA genes and few class III genes were known and could be included even in the most comprehensive association studies. The latest and perhaps ultimate map of the HLA complex contains more than 100 functional genes and the task is now to figure out the real association behind HLA associations. Several examples have appeared recently that however strong, some HLA associations are due to non-HLA genes. One has to remember that hemochromatosis and congenital adrenal hyperplasia originally showed associations with HLA antigens but not with HFE or CYP21A2. Therefore, from basic science point of view, the ultimate goal of HLA association studies is to find out how genes cause the disease or modify susceptibility or course of it. In the modern era of HLA association studies, the conceptualization, design, execution and interpretation have gained new dimensions. Hypothesis-driven studies are replacing fishing expeditions, extreme polymorphism of HLA genes forces researchers to take statistical power into consideration in sample size determination more frequently, and to consider replication more than ever to avoid well-recognized complications of multiple comparisons. Ever increasing global ethnic diversity, on the other hand, requires extraordinary care to prevent spurious associations resulting from population stratification. A large number of sophisticated methods have been developed by biostatisticians and statistical geneticists to handle the data generated from comprehensive association studies especially in diseases with complex genetic basis (Lewis, 2002; Forabosco, 2005; Cordell, 2005). This review is on the basic principles of traditional HLA association studies in terms of statistical interpretation with the ultimate aim of providing an insight to the modern methods of association data analysis.

General considerations and the multiple comparisons issue

The most common ways to make an inference for the population from which the analyzed sample has been drawn are significance probability calculation or confidence interval (CI) estimation. A CI contains the result of a significance test, but a significance test cannot provide the confidence limit. The statistical tests are essentially a test of whether the confidence limits include unity.

Each experiment starts with a null hypothesis which simply states that there is no difference between the two samples; that is 'two samples have been drawn from the same population.' A test statistics generally relies on the comparison with the observed distribution of what is expected if the null hypothesis is true. If the null hypothesis is rejected by statistical testing, the alternative hypothesis (the effect is not zero; there is a difference, etc.) is accepted. The general test statistics is as follows:

test statistics = (observed value - hypothesized value) / standard error of the observed value

The choice of the test to do this statistics depends on what is being compared. The usual tests used in HLA association studies and also in population genetic analysis of the HLA system are briefly described later. Test statistics yield a probability of observing the value we found (or an even more extreme value) when the null hypothesis is true. If the statistics gives a P (probability of error) value of <0.05, this means that the observed value would have a probability of occurrence somewhere in the extreme 5% of the relevant distribution curve for the data and imply an unusual finding.

In significance testing, a type I error (a false-positive finding) is considered to be more serious, and therefore more important to avoid, than a type II error (missing a positive finding). If 0.05 is the pre-determined level of statistical significance, this equals to reaching one incorrect conclusion (false positive) with no biological relevance in twenty tests ¹. The value 0.05 is the probability of getting an erroneous result every time one compares two groups, which are in fact equivalent. This level of error can be accepted in the context of a single test of a prespecified hypothesis. The probability of getting at least one statistically significant result as a result of type I error in multiple comparisons is, however, much higher than this. This probability can be calculated with the formula: 1-(1-a )ⁿ, where n is the number of comparisons and a is the arbitrarily chosen significance level which corresponds to the accepted failure rate for each comparison ¹^-⁴. If the probability of getting the right result is 0.95 (P = 0.05) in a single test, it would be (0.95)²⁰= 0.36 in 20 tests. Thus, the probability of getting it wrong is 1 - 0.36 = 0.64. This is to say that, for 20 comparisons and the significance level of 0.05, the probability of getting at least one erroneous result is 0.64 (which is 0.92 for 50 comparisons). This probability is smaller when the significance level is lower (0.18 for 20 comparisons and the significance level of 0.01). This argument applies to situations when the multiple comparison tests are independent tests.

To avoid a type I error, when independent multiple comparisons are carried out, and all genotypes examined have the same chance of being increased or decreased, a statistical safeguard should be applied. There is no definite limit for the number of comparisons that makes this necessary but it is a must if it is greater than 20. This is usually done by multiplying the P value by the number of comparisons (Bonferroni inequality method) ^3;5;6. This is basically lowering the statistical significance level and making the statistical test more conservative. For small numbers of comparisons (say up to five) its use is considered to be reasonable, but for larger numbers and for the multiple tests that are not independent (highly correlated), it is highly conservative ^1;6 (see also Bland & Altman, 1995). The precise correction is obtained by the formula:

P_corrected= 1 - (1-P)ⁿ

where P is the uncorrected P value, and n is the number of comparisons ⁷. The number to use for correction is not (as frequently done) the number of alleles or genotypes detected in the study but the number of comparisons one or more of which shows a significant result ⁸ (see below). For P values of less than 0.01, this formula gives almost the same result as simple multiplication ⁸. If the 'corrected' P value is still less than the pre-determined significance level (such as 0.05), then the result is significant.

There has always been a debate about what number to use in the multiplication. In contrast to the most common practice, this is not supposed to be just the number of alleles in the locus analyzed ⁹. If the allele frequencies are compared between patients and controls, the number of alleles is important as well as the number of comparisons in terms of age groups, sex groups, clinical subgroups, etc. When two antigens in linkage disequilibrium are investigated in an association study, the correction issue becomes slightly more complicated. Svejgaard & Ryder ⁸ has recently discussed this problem. When two loci without linkage disequilibrium between them are studied, Svejgaard recommends considering them separately in multiplication ⁹.

Another way to correct for the bias due to multiple comparisons is to do a second study to check the frequency of the same specificity in the same disease ^5;9;10. If the uncorrected P value is £ 0.05 as in the first study, then the second study has confirmed the results of the preliminary study. The second study is said to be done with a specific hypothesis, therefore, an expected association would not be regarded as an artifact of multiple comparisons.

It has been argued, however, that by using the multiplication rule to avoid type I errors, the risk for type II errors increases ^2;11. This means that some genuine finding may be ruled out as a chance finding where it is worth pursuing. This is a good point since in association studies in a disease with multifactorial etiology, there will rarely be a very significant association when a single individual factor is considered. Any significant association, regardless of the corrected P value, should be critically evaluated and pursued further if biologically plausible. As mentioned above, a second study may offer the best explanation to these kinds of marginal findings.

Murray has proposed another method for the design of the study, which would preserve the sensitivity while avoiding a high, risk of type I error ¹². This method may be most useful for HLA association studies. This is to set down a priori a hierarchy of comparisons of interest. There should be a single comparison of primary interest, upon which the sample size (power) calculation would be based, and the analysis of which would be taken at face value. Thus the sensitivity is preserved for the most important question, since one has to achieve P < 0.05 in order to claim significance. There would then be a limited number of prespecified secondary comparisons, which would carry a lesser weight. But if significant after the correction for multiple comparisons, it cannot be too lightly dismissed. After that, any other comparisons of interest would be made in a purely exploratory fashion, in an attempt to generate hypotheses for future studies. Even the most extreme P values emerging in these final comparisons should be given very little weight until there is independent supporting evidence from other studies (a second study based on a specific hypothesis).

In their discussion of the statistical analysis of retrospective studies ¹³, Mantel and Haenszel state that "if the purpose of the retrospective study is to uncover leads for fuller investigation, it becomes clear there is no real multiple significance testing problem. A single retrospective study does not yield conclusions, only leads." This situation corresponds to a full HLA locus analysis in a disease with no a priori hypothesis for the nature of an association. This effort would then be called a hypothesis generation approach. Mantel and Haenszel go on to say "also, the problem does not exist when several retrospective and other type studies are at hand, since the inferences will be based on a collation of evidence, the degree of agreement and reproducibility among studies, and their consistency with other types of available evidence, and not on the findings of a single study." In the HLA field, this is the case when there are several independent studies on the same association in the same disease or when there is a second study performed by a hypothesis.

A specific example for the application of Murray's proposition in the light of Mantel and Haenszel's view to the HLA field would be something like that: a study can be designed to investigate the relevance of homozygosity for an HLA class II supertype in childhood leukaemias as the same genotype has been found to be increased in two adult leukaemias. Therefore, the specific a priori hypothesis is that homozygosity for this supertype is increased in childhood leukaemias. The comparison of its frequency between patients and controls would not require any statistical safeguard for multiple comparisons. If an increase is shown, then one would like to check the individual members of this supertypic family to see whether this increase is due to an increase in any of them. These comparisons would require the correction for the multiple comparison before any increase is declared to be significant. The third step would be to analyze all HLA-DR types since the data is available anyway. If any of the other HLA-DR types is found to be increased or decreased, caution should be taken in making a big event out of this. Such a finding should be subject to confirmation with a second study unless, of course, the P value can already stand multiplication by the number of all these comparisons made.

More recently, Klitz suggested that if a G-test is applied to the overall distribution of genotype frequencies between patients and controls and if this yields a significant result, the individual associations cannot be taken as the result of multiple comparisons ¹⁴. He reminds us that looking for an association using individual 2x2 tables for each antigen is the old method stemmed from the presence of blank alleles at the time of serological typing. This would certainly lead to the multiple comparisons problems. Nowadays, blank alleles are not a problem anymore and a G-test (or C²-test) on a 2xN table, where N is the number of alleles/haplotypes or genotypes, is a sensible approach. A significant deviation in this analysis justifies further scrutiny of the data to single out the main association. Klitz also suggests grouping of very rare alleles in one category to increase the sensitivity of the test.

Confusion may occasionally arise through wrong usage of the terms allele, gene, or marker in an association study. Some investigators state that they compare allele or haplotype frequencies, but only count each individual once. They, therefore, refer to what used to be phenotype frequencies in serological HLA studies, or in the case of genotyping studies, to marker frequencies (MF), which correspond to inferred phenotype frequencies if it is an expressed genotype. Allele (AF) or haplotype frequency (HF) is analogous to gene frequency (GF) in that they are always calculated in terms of the total number of chromosomes not individuals.

For a comprehensive discussion of the multiple comparisons issue, see Intuitive Statistics: Multiple Comparisons.

Case-control design and related issues

Population stratification can be thought of as confoundingby ethnicity. If ethnicity of cases and controls are reliably known, a stratified analysis would eliminate this problem. However, it is the unknown stratification within the population that causes this undesirable effect. To overcome this problem, software designed by Pritchard, Structure and Strat can be used with genomic controls (Devlin, 1999; Pritchard, 1999).

In case-control studies, ideally there should be at least one control per case. If the number of cases is limited and cannot be increased easily, it may be an idea to increase the number of control to increase statistical confidence. The law of diminishing returns, however, dictates that a maximum of five controls per one case is the limit (Epidemiology for the Uninitiated by Coggon, Rose and Barker. BMJ Publishing Group, 1997). A higher control to case ratio will not provide further benefit and may even result in type I errors.

The following quality control questions for clinical case-control studies also apply to genetic association studies:

1. Were the groups similar apart from the exposure under question?

2. Were the data on outcomes collected identically in both groups?

3. Is there a dose-response effect?

4. Does the relationship make biological and chronological (temporal) sense?

5. How strong is the association and how precise is the estimate?

Power calculations and expectation should be realistic in a genetic association study with a complex (multifactorial) disease. Genetic susceptibility to such diseases involves a large number of alleles, each conferring only a small genotypic risk (like OR = 1.2 to 2.0), that combine additively or multiplicatively to confer a range of susceptibilities in interaction with environmental factors. Apart from increasing the numbers of cases and controls to have greater power and validity, hypernormal controls or genetically severe cases (like cases selected for a family history of disease) can be used. Other strategies to increase statistical power in an association study are the use of haplotypes rather than alleles and even an ancestral haplotype approach in a cladistic haplotype analysis. Underpowered studies are one of the main reasons for failure to replicate an initial association study.

Significance testing

Constructing a 2x2 contingency table

When two groups are to be compared in a case-control study, it is necessary to have a 2x2 contingency table cross-tabulating the frequencies. This table is required for significance testing, relative risk (RR) or odds ratio (OR) estimation and CI calculation. A contingency table may have more than two rows and columns but the 2x2 table approach stems from classic case-control studies in epidemiology as the most elemental data structure leading to ideas of association ¹⁵. The current trend in genetic association analysis is to consider the genotypes as the genetic factor or alleles if a multiplicative risk model is appropriate (Sasieni, 1997; Lewis, 2002; Cordell, 2005), see Genetic Models. For a multiallelic locus like the HLA loci, however, the number of genotype categories is large. This is why originally the comparisons based on the presence or absence of an allele were thought to be more parsimonious and was established as the standard approach for HLA association studies ⁵. Another reason for using individual 2x2 tables for each antigen in HLA association studies was the presence of blank alleles at the time of serological typing (which is no longer the case) ¹⁴. This contingency table is one of several ways of analysis genetic association study data and does not take into account underlying genetic model {recessive, dominant, codominant, additive, multiplicative) (Lewis, 2002; Sellers, 2004; Cordell, 2005). HLA association tests relying on the presence of an allele in cases vs controls implicitly use dominant genetic model (carrying at least a copy of the allele is all that matters) but in compliance with the current trends, all genetic models should be explored if the genetic model for susceptibility is unknown (Lewis, 2002; Sellers, 2004; for online genetic model-based analysis: DeFinetti; MODEL).

A general layout of a contingency table for a conventional HLA-disease association study is as follows:

			allele i
		Present		Absent	Row totals
	Patients	a		b	a+b
Outcome
	Controls	c		d	c+d
	Column totals	a+c		b+d	N=a+b+c+d

The number in each cell is a count, i.e., a non-negative integer. Each subject must appear in one, and only in one, cell. In the table, switching rows and columns will not alter the result. When a contingency table is subject to a hypothesis testing, the null hypothesis is that there is no association between the two variables, the alternative being that there is an association of any kind (two-sided test). To do the hypothesis test, it is generally necessary to calculate the expected frequencies for each cell (see below). If the Chi-squared test is to be used, all expected frequencies should be at least 1 and at least 80% of them should be more than 5 (Cochran rule) ¹⁶. This means that for a 2x2 table, all expected frequencies should be more than 5. As will be discussed below, the best approach is to use an exact test which is not based on approximation or assumptions and does not require any kind of correction. Such tests used to computationally highly demanding but this is no longer the case.

Choice of a statistical test

The usual choice for testing the significance of an HLA-disease association studies is the Fisher’s exact test ⁹. This test is described below. Several other tests based on approximations can also be applied for the same analysis ^15;17. These are the ordinary Chi-squared test (with or without Yates's correction) ¹⁸, two-sample Z-test ^6;18, Woolf's method ¹⁹, and the Mantel-Haenszel C² test ^13;20;21. When one or more of the expected numbers in the 2x2 table is less than five, and when the overall sample size is small, the Fisher’s test is commonly believed to be the only reliable one ^9;17;22, although there are also strong arguments against this ¹⁵.

Yates's correction for discontinuity

If the C² test is used in the statistical analysis of the results, it may be necessary to use the Yates's continuity correction for small samples ^15;17;23. The Chi-squared distribution is continuous. That means that the curve of the C² distribution model is continuous without any breaks. The values we calculate in C² are, however, discrete values. This is because observed frequencies vary in discrete units (the number of occurrences of a gene -the entries in the 2x2 table- may be 4 or 5 but not 4.6). With degrees of freedom greater than 1 and with expected frequencies of at least 5 in each cell, this is not a problem as the difference between the statistics and the true sampling distribution is so small. For example, the difference between 100 and 101 is negligible (1%) compared to the difference between 4 and 5 (25%). Any difference between observed and expected frequencies will appear large when cell frequencies are small and may result in a type I error.

The Yates's correction helps make the discrete data generated by the test statistics [å (O-E)²/E] more closely approximate to the continuous Chi-squared distribution. This is achieved by changing the above formula to [å (½O-E½ -0.5)²/E]. By doing so, the discrete data distribution and continuous data distribution are approximated better. This will result in a smaller calculated value of C² and will reduce the risk of a type I error. In relatively large samples, this would not make an important difference. Most textbooks recommend that the Yates's continuity correction should be applied when the sample size is small or the contingency table contains any number less than 10 ^24;25. This correction, however, tends to overcompensate for discontinuity and may result in a more conservative decision than necessary ^15;26. As a simple rule, if a result is still significant after correction, or already non-significant without it, there is no problem. When a significant result becomes non-significant with the correction, this might be due to overcompensation. It is argued that the Yates’s correction produces severe conservative bias in the C² test for association (C² test for independence or homogeneity) where the observed marginal frequencies in the contingency table are subject to sampling variability ^15;26;27. Some statisticians strongly argue that the Yates’s correction is only necessary when the experimenter fixed both sets of marginal totals and when the cell frequencies are small ^15;26. For other types of contingency tables, Yates’s correction may indeed be too conservative but the current tendency is stronger towards its usage ¹.

Significance level

In the study of biological variations, a P value of less than 0.05 is generally considered to be statistically significant. In HLA association studies, because of the nature of the HLA system and random variations in gene frequencies, it is not infrequent that a P value of this magnitude is obtained. When the significance level is chosen as P = 0.05, most reported associations cannot be confirmed in subsequent studies ⁸. It has been suggested that a P value of less than 0.01 should be used for the significance of an HLA association, and the description 'highly significant' should be reserved for P values less than 0.001 (Ref. 8).

Occasionally, a relative risk (RR) or odds ratio (OR) is quoted as a result of comparison between patients and controls without a P value or CI. A RR may be more than 1.0 or even greater, especially in small groups, but without statistical significance it is meaningless. If CIs are calculated for such RRs (i.e., not associated with a P £ 0.05), it would be a very wide one extending to both sides of 1.0. In one report, a RR similar to the one reported for an HLA association in Hodgkin's disease can be found even for comparisons between two control groups of the same study ²⁸. A RR/OR should always be reported together with its statistical interpretation and CIs.

The obtained P value (or significance probability) should be interpreted properly. Statistics does not prove the truth of anything, it just provides more or less evidence for its validity. The significance probability describes the extent to which the data support the null hypothesis: if the statistical experiment were to be repeated on many subsequent occasions and if the null hypothesis were true, the P value represents the proportion of further experiments that would support the null hypothesis ²⁹. More technically, it is the probability of obtaining a value of the test statistics as large as or larger than the one computed from the data when in reality there is no difference between the proportions. Basically, a P value is the probability of being wrong when asserting that a true difference exists. If this probability is less than 5%, we usually reject the null hypothesis. Just as P £ 0.05 does not prove anything, P > 0.05 does not necessarily mean the null hypothesis is correct. These may be due to type I and type II errors, respectively. In any case, a significant result only adds weight to the alternative hypothesis. A negative result should be evaluated taking into account the power of the test. If a genotype occurs in two patients (n = 100), its total absence in 300 controls would yield a P value of 0.06 (Fisher’s exact test) with a RR of 15.3 (Woolf-Haldane analysis). This result cannot be easily dismissed as non-significant.

One-sided or two-sided test?

Some comparisons, like those between two means or two proportions, can be evaluated by one-sided or two-sided P values (all comparisons of three or more groups are two-sided) ³⁰. When a P value is chosen, say P = 0.05, we assume that if the difference between the two values (p₁ andp₂) is so unusual that if we repeat the sampling from the population 100 times, in 95 of these, the difference will be bigger than the observed, we will consider this difference significant. But we do not state the direction of the difference (both p₂-p₁ and p₁-p₂ are of interest). If we are evaluating a difference in either direction, a two-sided P value is required. If, however, the direction of the difference is stated in the hypothesis; for example, if the alternative hypothesis is that there is a difference and p₁ is greater than p₂, the two proportions can be evaluated by a one-sided P value. However, Bland & Altman ³⁰ state that "in general, a one-sided test is appropriate when a large difference in one direction would lead to the same action as no difference at all. Expectation of a difference in a particular direction is not adequate justification". Inappropriate use of the one-sided test would double the risk of a spurious significant difference. Two-sided tests should always be preferable unless there is a very good reason for doing otherwise.

Technically, the two-sided P value corresponds to the total area in both ends (tails) of the distribution curve for symmetrical distributions. If a one-sided P value is going to be used, it will correspond to the area at one tail of the curve and it will be half the value of the two-sided P value. A C²-test with 1 d.f. is essentially a two-sided test, whereas Fisher's test usually gives a one-sided P value and there are several ways to calculate the two-sided P value.

Degrees of freedom

The number of degrees of freedom is a way of referring to the number of independent variables involved. This is usually one less than the total number of variables. In a 2x2 table containing the number of patients and controls for the presence or absence of an allele or genotype, the degrees of freedom is one. This is because the number of independent variables is one. If anybody has an allele, that person cannot also be negative for it. Similarly, a person is either a patient or a control. The general rule to calculate the degrees of freedom is to multiply the degrees of freedom for the rows and columns [i.e., d.f. = (r-1)(c-1)] ^1;25.

In the study of a bi-allelic locus, despite having three possible genotypes (say, AA, Aa, aa) the degrees of freedom is still one. These three genotypes are not independent of one another. Given the gene frequency of either A or a, the other one's frequency, and subsequently all genotype frequencies can be calculated. Thus, the number of independent variables is one. If a study is investigating the association of five different independent phenotypic markers in patients and controls (two groups), the degrees of freedom will be (5-1)(2-1) = 4. The 'degrees of freedom' is taken into account in the translation of the C² value to a P value. This is because the C² distribution is different for each degrees of freedom.

Significance tests

For the most common situation in the HLA field -the comparison of two independent proportions (p₁, p₂)- equivalent hypothesis tests, the Chi-squared test, the two-sample Z-test, the G-test, or the Fisher’s exact test can be used ^6;17 (for a link to Online Analysis of a 2x2 Table, see the end of this document).

Chi-squared test

To calculate the C² for the same data, a 2x2 table is first constructed, actual numbers of occurrences are placed, and the expected frequencies in each cell are calculated. The expected frequency in a cell is the product of the relevant row and column totals divided by the sample size (grand total; N = a+b+c+d). For the cell with observed frequency 'a', for example, the expected value is (a+b)(a+c)/N. The difference between observed (O) and expected (E) values (residual) is the same for each cell but with different signs (- or +). This means that there is only one independent observation rather than four, so just one degree of freedom. The difference between O and E for each cell (O-E) is then calculated. The C² is calculated as the sum of (O-E)²/E for all four cells:

C²= å [(O-E)²/E]

Intuitively, as the expected frequencies are calculated for each cell, our entries should be observed frequencies (there are other tests to compare an observed proportion with an expected one). The Chi-squared test relies on an normal approximation to the distribution of the cell counts, and this approximation may be poor for small sample sizes.

Alternatively, a different version of the C² formula can be used ^6;24. This version uses the observed frequencies in cells and avoids the need to calculate the expected values explicitly:

C²= [N (ad-bc)²] / [(a+b) (a+c) (b+d) (c+d)]

This formula for C² for a 2x2 table is mathematically identical to the general formula for C² but can only be justified for large samples ¹⁵. As for the Z-test, the P value for the C² test is obtained from tables.

There is one well-justified argument against the use of the Chi-squared test. Williams reminds that Pearson originally developed the goodness of fit test as a computationally convenient approximation of the maximum likelihood statistics or the G-test ³¹. It can be shown mathematically that the approximation is valid if the deviation of observed from expected frequencies is smaller than the expected frequency (½O-E½ < E ). Thus, if even in one cell, ½O-E½ > E, the Chi-squared test should not be used. The Chi-squared distribution is usually poor for the test statistics G² when N / rc is smaller than five ²¹. Williams also reminds us that it is now computationally easy to calculate maximum likelihood statistics and there seems to be no reason for the continued use of Pearson's approximation.

A common mistake in the analysis of HLA sharing data is to use the ordinary C²-test when C²-trend test is more appropriate ³². When there are more than two categories (number of antigens shared in two loci is 1 to 4) compared between two groups (those having recurrent spontaneous miscarriages and fertile couples), and there is an ordered increasing or decreasing difference along the N categories, the appropriate test to use is a trend test for 2xN tables ^17;33;34. In this situation, the C²-test completely ignores the order of columns. Alternative tests suitable for similar situations are the Wilcoxon-Mann-Whitney test or the t-test with use of ordered scores ^33;35. In addition to the fetal loss studies, the trend test has also been used in an analysis of increasing heterozygosity in the HLA-A and -B loci in the elderly compared to children ³⁶. (See Analysis of Association, Confounding and Interaction for an explanation of the trend test.)

Armitage trend test can also be used for case-control study association data when the underlying genetic model is presumed to be additive {where risk increases by r-fold for heterozygotes and 2r-fold for homozygotes for the risk allele} (Lewis, 2002). This gradual increase is best assessed by the trend test by collapsing all other alleles into one category so that there will be three genotypes to compare in cases and controls. For SNP data, usually there are three genotypes (AA, AB, BB) and the trend test is the best test if the genetic model is additive. One other advantage of trend test is its robustness against deviations from HWE (Sasieni, 1997; Xu, 2002).

Another misuse of the Chi-squared test is to use it when McNemar's test should be used. The Chi-squared test of independence of the two variables in a 2x2 table provides a test for the equality of two proportions when these proportions are estimated from two independent samples. If the two variables have derived from the same individuals, the Chi-squared test cannot be used as the variables are not from independent samples. The appropriate test for equality of proportions in paired samples is the McNemar's test ^17;37. This test finds its use in situations where the same sample is used to find out the agreement (concordance) of two diagnostic tests or difference (discordance) between two treatments. The difference is similar to that of two-sample t-test and paired t-test for interval data. Like the trend test, the McNemar's test gives a smaller P value than the ordinary Chi-squared test for the same 2x2 table. Thus, it is again for the benefit of the researcher to use it when it is more appropriate to use. The calculation is not much different from what is done to obtain the ordinary C² value, but only the discordant values in the 2x2 table are included (with Yates's correction) ³⁷ (Online McNemar's Test).

G-Test

The likelihood ratio (Chi-squared) test or maximum likelihood statistics are usually known as the G-test or G-statistics ³⁸. Whenever a Chi-squared test can be employed, it can be replaced by the G-test. In fact, the Chi-squared test is an approximation of the log-likelihood ratio, which is the basis of the G-test. Pearson originally worked out this approximation because the computation of the log-likelihood was inconvenient (but it no longer is). The Pearson's statistics, C² = å [(O-E)²/E] is mathematically an approximation to the log-likelihood ratio or G= 2 å O ln (O/E)

The value called G approximates to the C² distribution. The G value can also be expressed as

G = 2 [å O lnO - å O lnE] = 4.60517 [å O log₁₀O - å O log₁₀E]

The G-test as calculated above is as applicable as a test for goodness of fit using the same number of degrees of freedom as for Chi-squared test. It should be preferred when for any cell ½O-E½ > E.

For the analysis of a contingency table for independence, Wilks ³⁹ formulated the calculation of the G statistics as follows:

G = 2 [ å å f_ij ln f_ij - å R_i ln R_i - å C_j ln C_j + N ln N ]

where f_ij represents entries in each cell, R_i represents each row total, C_j represents each column total, and N is the sample size. The same formula can be written using logarithm base 10 as follows:

G = 4.60517 [ å å f_ij log₁₀ f_ij - å R_i log₁₀ R_i - å C_j log₁₀ C_j + N log₁₀ N ]

The G value approximates to C² with d.f. = (r-1)(c-1). When necessary, Yates' correction should still be used and the formula needs to be modified accordingly. With the exception of the above mentioned condition that ½ O-E½ should be smaller than E for the Chi-squared test to be valid, there is not much difference between the two tests and they should result in the same conclusion. When they give different results, the G-test may be more meaningful. The G-test has been gaining popularity in HLA and disease association studies ^14;40 (Online G-test).

Fisher’s exact test

When the Chi-squared test is used for small samples, Yates's correction may need to be used to make adjustments for continuity. This correction, however, does not remove the requirement for the expected frequencies. As mentioned above, the expected frequencies in each cell (of a 2x2 table) should be at least five for the C² test to be reliable (note that the observed frequencies may be less than five). The general belief is that when there is one or more expected frequency of less than five, the alternative approach for significance testing is the Fisher’s exact test ^1;6;9;38. There is no equivalent method of estimation for comparing proportions from very small samples and Fisher test is the best choice for such data ²². The result of a Fisher’s test is usually more or less the same as the result of the Chi-squared test with Yates's correction. Fisher's exact test does not depend on an approximation to the probability distribution of the cell counts. There are, however, statisticians defending the idea that this is a misconception in statistics and the Fisher’s exact test is not suitable for small sample sizes ¹⁵. Yates himself replied to such criticism and defended the suitability of both Fisher's test and Yates's correction where appropriate ⁴¹. He also quotes Fisher as stating that the exact test should always be used whether or not the margins are determined in advance.

The Fisher’s exact test also uses the frequencies in a 2x2 table but the calculations are different from that of the other significance tests. It is based on the observed row and column totals. The method consists of constructing all possible 2x2 tables giving the same row and column totals as the observed data. For each table, probability for such data to arise if the null hypothesis is true is calculated. Then, the overall probability of getting the observed data is calculated. To get the probability, the probability of the observed data and all other probabilities for alternative 2x2 tables equal or more extreme than that of the observed data are added up ⁶. Calculations are mathematically very complex, include factorial calculations, which may be very cumbersome for big numbers. The Fisher’s exact test can now be applied by means of computers (which can cope with the calculations of factorials for large numbers) even to large samples. This test is essentially one-sided. To get the two-sided P value, if required, the P value may be doubled ¹ (to perform Fisher's Exact Test online, see the links at the end). There are exact tests available other than Fisher's and one of them RxC is designed by Mark Miller and can perform the exact test for large contingency tables.

It is still common to mix the results of Chi-squared and Fisher's exact test in the same study. If some comparisons require the use of Fisher's exact test because of small expected numbers then it is best to use the Fisher's test for all comparisons. It does not make much sense to report some results by one test and others by another when all can be done most reliably by an exact test.

Two-sample Z-test

For the Z-test, only the two proportions (p₁, p₂) together with the sample sizes (n₁, n₂), and the estimated proportion (p_e) in the population are required ^6;18. The desired condition for the validity of this test is large enough (³ 30) size of both samples. The basic formula for this test is:

Z = (p₁-p₂) / SE_diff

The calculation of the SE of the difference (SE_diff) is slightly different for this test. First, the expected (combined) proportion in the population is calculated:

p_e= (p₁₊p₂) / (n₁+n₂)

This is then used to calculate SE_diff:

SE_diff= [p_e(1-p_e) (1/n₁+1/n₂)]^1/2

One-sample Z-test

In HLA studies, it may be necessary to compare an observed value with an expected value. The expected value may be a homozygosity rate calculated from the gene or haplotype frequency in the same sample of the population assuming Hardy-Weinberg equilibrium.

The statistical test for the null hypothesis 'population proportion is equal to a specified proportion' is the Z-test for single proportion ^6;18. The Z value for this comparison is calculated as follows:

Z = p_o-p_e / [p_o(1-p_o)/N]^1/2

where p_o is the observed and p_e is the expected proportion. The corresponding P value for the Z statistics can be found from the standard normal distribution tables. For a two-sided test, if Z is greater than 1.96 (or smaller than - 1.96), P < 0.05.

Confidence interval estimation

When a value (a proportion or OR obtained from a sample) is the estimate of an unknown "true" value (within the population from where the sample has been drawn), CIs can be applied to them. CI is more informative than the simple results of hypothesis tests, where a null hypothesis is rejected or accepted. One of the traps of the hypothesis testing is that non-significance may be equated with accepting the null hypothesis whereas it may just be a failure to reject it (due to a small sample size, for example). In the case of comparing two groups, a CI enables the researcher to see how large the difference between two proportions may be, not simply whether it is different from zero. It also shows whether a non-significant result suggests an outright rejection of the alternative hypothesis -lack of difference, i.e., CI includes zero and does not reach the critical value for the difference-, or an inconclusive result, resulting from lack of evidence (CI includes zero but also exceeds the critical value) ⁴².

CIs provide a range of plausible values for the unknown 'constant' parameter (such as the mean or expected frequency in the population from which the sample has been drawn). CIs can be calculated for different confidence levels. If a CI is calculated at a 95% level (as usually done), 5% of the time the true population parameter will not be contained within the interval calculated from the sample statistics. More technically, it means that 95% of all samples drawn from the population will have the population parameter within this interval. In a way, this is the acknowledgement of the fact that a different sample from the same population may produce a different result.

The width of the CI also gives us an idea about how uncertain we are about the unknown population parameter (the mean, for example). The most common CIs are calculated for a mean (or a single proportion), for the difference between two means and for a RR or OR. A very wide interval may indicate that the sample size should be increased to be more confident about the parameter. If the hypothesis test yields a significant result at the 5% level, the lower limit of the 95% CI for the difference between two means will be more than 0, and it will be more than 1.0 (unity) for RR or OR.

Calculation of confidence interval

In general, for any statistics that has a normal sampling distribution (such as the difference between means or proportions), a CI is constructed by adding to or subtracting from an estimate, a multiple of its standard error ¹⁸. For example, a 95% CI is given by

statistics ± 1.96 SE

where SE refers to the standard error of the statistics, and 1.96 is the critical Z value (Z_c) for P = 0.05 (or 2.58 for 99% CI interval). The value of the Z_c does not depend on the sample size.

CI for a single proportion can be calculated using this principle:

CI = proportion ± Z_c SE

The SE for a single proportion (p) from a sample with size n is:

SE = [p(1-p)/n]^1/2

Thus, 95% CI of a single proportion becomes ⁴³:

p ± 1.96 [p(1-p)/n]^1/2

As seen in this formula, sample size has an inverse relationship with the CI. Larger samples yield a narrower CI. It can be shown mathematically that p(1-p) can simply be replaced by 0.5. Whatever the value of p may be, p(1-p) will never be greater than 0.25. If we use 0.5/Ö n, we will be using a number equal to or larger than the real SE in the CI formula. If anything, this could only result in a wider CI.

A 95% CI of the difference between two proportions is calculated as ⁴³:

P_diff ± 1.96 SE_diff ........... SE_diff= 1/n [(b+c)-((b-c)²/N) ]^1/2

(b and c are from the 2x2 table; N is the grand total in the 2x2 table)

An alternative formula for 95% CI of the difference between two proportions (p₁ from n₁ samples, and p₂ from n₂ samples) ^6;18:

(p₁ - p₂) ± 1.96 [p₁(1-p₁)/n₁+p₂(1-p₂)/n₂]^1/2

When a difference in the frequency of a gene or genotype is reported, the 95% CI for the difference should accompany this finding together with a P value rather than CIs of each mean separately ^4;43. It is now possible to analyze a 2x2 table and obtain OR together with its CIs online (see the end of the document for the link).

Relative risk / Odds ratio calculation

The calculation of RR conferred by an HLA antigen / haplotype / genotype is usually done by Woolf's method ¹⁹ which was later modified by Haldane ⁴⁴. Woolf defined the relative incidence (RI) as follows:

RI = ad / bc (a, b, c, d are the entries in the 2x2 table presented above)

Conventionally, RR is used in HLA and disease studies instead of relative incidence ⁵. To confuse the terminology further, the cross-product ratios described above as RI actually gives what is called odds ratio (OR) in epidemiology. This discrimination is not always made properly in publications on HLA associations. In the study of HLA specificities, some of the cell frequencies in the 2x2 table may be very small or even zero. If all the patients are positive for the HLA specificity or if none of the controls has it, the denominator would be zero. In this case, the RR would be undefined. For such situations Haldane modified Woolf's formula for RR:

RR = (2a+1)(2d+1) / (2b+1)(2c+1)

More precisely, the RR in a prospective epidemiological study is defined differently (this is the real RR as it is used in epidemiologic studies] ^6;13;29. In the 2x2 table, the proportion of persons with the risk factor (allele i in this case) having the disease is a/(a+c), and the corresponding proportion for those lacking the risk factor is b/(b+d). Because relative risk is obtained by dividing the risk of development of disease among subjects with the risk factor by the risk of development of disease among subjects without the risk factor:

RR = (a/(a+c)) / (b/(b+d)) which equals to RR = a(b+d) / b(a+c)

Especially for small values of a and b compared to c and d, ad/bc (which is the OR) is a close approximation ^1;6;13. If the probability of an event is p, then the odds of that event is p/(1-p). The OR is the ratio of the odds of the risk factor in a diseased group and in a non-diseased (control) group (RR is the ratio of proportions in two groups). OR is more appropriate for retrospective case-control studies ^1;6. In retrospective studies, the selection of the subjects is based on the outcome (the rows) such as patients and healthy controls. In prospective studies, however, it is based on the characteristic defining the groups (the columns, presence or absence of an HLA type). Thus, in a prospective study, it is known how many individuals in the risk group developed the disease. In a retrospective study, RR would not be a precise estimate as the risk of the outcome cannot be evaluated (because of the way the subjects were sampled). In the above RR formula, if a and b are small as usually are (hence the choice of retrospective design because of the rarity of the disease), RR is approximated to OR (ad/bc). In other words, both values of (1-p) for exposed and non-exposed groups will be close to 1 and can be ignored. This is why in both retrospective and prospective HLA and disease association studies, ad/bc is calculated and quoted as either OR or RR. In cohort studies, however, and especially when the outcome is frequent enough (>10%), the OR does not approximate to RR ⁴⁵. Under the null hypothesis, the expected value of RR or OR is 1.0 (unity).

Several tests have been described to compute homogeneity of odds ratios for stratified data (combining estimates of the odds ratio) ^17;20;21;46^-⁴⁸. The most popular of these is the Mantel-Haenszel test for the common odds ratio for stratified data ^17;20;21. A Chi-squared test for homogeneity of ORs over different strata can be performed using Woolf's method ¹⁷.

Confidence interval for relative risk or odds ratio

As mentioned above, for variables with a normal sampling distribution, we need the SE of the statistics to calculate the CI. RR or OR itself does not have a normal sampling distribution, therefore, to achieve approximate normality, it has to be transformed. Since the natural logarithm of the RR or OR (lnRR or lnOR) has normal sampling distribution, the same principle can be applied to it. It can be shown mathematically that the standard error of the transformed RR is equal to ¹⁸:

SE (lnRR) = (lnRR) / Ö C²

Substituting this into the general equation gives the 95% CI for lnRR as:

lnRR ± 1.96 (lnRR) / C

After some mathematical transformation which simplifies to:

RR⁽¹^±^1.96/^C⁾ and similarly OR⁽¹^±^1.96/^C⁾

A general formula for the calculation of CI of RR / OR at any significance value is:

RR⁽¹^±^Zc/^C⁾ or OR⁽¹^±^Zc/^C⁾

Confidence limits obtained with this approach are reasonably good approximations to the exact limits, especially when RR / OR is not too far from unity. An alternative approach to calculating the CI for OR is as follows ¹⁷:

e ^ln(OR)^±^Zc ^{[1/a + 1/b + 1/c + 1/d]1/2}

Estimation of haplotype frequencies

A haplotype contains a minimum of two loci on the same chromosome. If these two loci are not in linkage disequilibrium (LD), it can be said that the frequencies of their alleles are independent of each other. In this case, the expected frequency of a two-locus haplotype can be calculated as the probability of the occurrence of two independent (or joint) events simply by multiplying their gene frequencies. The same line can be followed for three- or multiple-locus haplotypes.

In the case of LD, however, the gene frequencies are not independent and the presence of one allele may influence the presence of a particular allele at the other locus or loci. It is safer to assume LD and calculate the LD parameter D in each case. For the expected frequency, D is added to the product of gene frequencies. If there is no LD, D will be zero (or not significantly different from zero), if there is positive LD it will be a positive value. It can also be negative if the two alleles tend not to occur together.

Ideally, the haplotype frequencies should be calculated from family typing data. Obviously, this gives the most accurate results. In practice, however, when family data are not available, D and haplotype frequencies are calculated from a sample of the population data by constructing a 2x2 contingency table if no software is accessible ⁴⁹. A contingency table for this purpose contains the individual (observed) values cross-classified by levels in two different attributes. A common 2x2 table constructed in HLA studies is as follows:

		allele i
		Present (+)	Absent (-)	Row totals
	Present (+)	a (+/+)	b (+/-)	a+b
allele j
	Absent (-)	c (-/+)	d (-/-)	c+d
	Column totals	a+c	b+d	N=a+b+c+d

Counts for each combination of levels (presence or absence) of the two factors (alleles) are placed in each cell. The corresponding D _ij is estimated by the formula:

D_ij= (d/N)^1/2 – [((b+d)/N) ((c+d)/N)]^1/2 (originally described by Bodmer & Bodmer, 1970; see also Schipper, 1998)

The haplotype frequency (HF_ij) equals to GF_ix GF_j+ D _ij, where GF is the gene frequency (the proportion of the chromosomes carrying a particular allele). The haplotype frequency calculated with this formula from the population data compares reasonably well with the estimates obtained directly from counting haplotypes constructed from family segregation data ⁴⁹. A recent study concluded that this method generates a reliable estimate of a haplotype frequency with the exception of very small haplotype frequencies (Pearson's r = 0.9949) (Ref. 50).

Statistical Methods for LD Estimation

While the Mattiuz formula can be used to calculate two-locus haplotype frequencies manually, an alternative method, the maximum likelihood estimation, can only be used if computing facilities are available. This test was originally described by Yasuda and Tsuji ⁵¹, compared with other methods by Schipper et al ⁵⁰. It can also be used for multiple-locus haplotypes ⁵². User-friendly software to calculate linkage disequilibrium values can be downloaded from the internet which will allow estimation of a variety of LD measures. One of the most sophisticated population genetic data analysis packages ARLEQUIN uses EM algorithm to calculate multilocus LD. Some of the other software to perform LD analysis are: Genetic Data Analysis, EH, MLD, LDA, MIDAS, Haploview and PopGene. LD analysis can also be performed online (Online LD Analysis, Genotype2LDBlock).

Interpretation of LD Data

The patterns of LD observed in natural populations are the result of a complex interplay between genetic factors and the population's demographic history. LD is usually a function of distance between the two loci. This is mainly because recombination acts to break down LD in successive generations. However, physical distance could account for less than 50% of the observed variation in LD. Other factors that influence LD include changes in population demographics (such as population growth, bottlenecks, geographical subdivision, admixture and migration), selective forces and local variation in recombination rates. An extraordinary example of the effect of recombination rates on LD is the discrepancy between genetic distance and physical distance between HFE and HLA-A, which generates strong LD despite 5Mb distance. Regional LD may also be variable according to haplotype. An example has been presented for HLA haplotypes (Ahmad, 2003). Haplotype-specific patterns of LD may reflect haplotype-specific recombination hotspots as has been shown for mouse MHC (Yoshino, 1994). For more on LD, see Basic Population Genetics.

Other statistical tests

Loglinear modeling: A loglinear model can be used in the analysis of contingency tables especially in those larger than 2x2 ^38;53. It allows more than two discrete variables to be analyzed and interactions (associations) between any combinations of them to be identified. An association in a contingency table appears as an interaction in a loglinear model. In a multidimensional table, each interaction (second order or higher order) can be analyzed without constructing separate smaller tables. In a loglinear model, the counts in cells are the response (dependent) variable and is modeled using the explanatory (independent) variables (categorical variables defining the rows and columns of a contingency table). The interpretation is made by means of the differences in the regression deviances of alternative models. The deviance has a Chi-squared distribution.

Practical uses include comparison of two 2x2 tables. The most frequent use of loglinear models in HLA research is the estimation of linkage disequilibrium values ^54;55. To make inferences about linkage disequilibrium with a loglinear approach, a series of multiplicative models is constructed that includes different numbers of disequilibrium terms. The difference in regression deviances for two models that differ only by one term can be used to assess the significance of linkage disequilibrium represented by that term. The need for loglinear model is obvious when two LD values are to be compared. This could be a comparison of LD between the two sexes, or between patients and controls. In this case, a model is set up and the three-factor (two alleles and sex/disease status) interaction is assessed. A similar use may be assessment of the significance of the difference in the association in two subgroups or two sexes. For online loglinear analysis of two 2x2 tables, see the link at the end.

Logistic regression: Simple logistic regression is used when the outcome variable is binomial (such as dead/alive, disease absent or present, metastasis absent or present). It is ideally suited for the analysis of case-control studies where the outcome is either being a case or control ^47;56;57. The explanatory variables may be binomial, categorical or continuous (such as smoker or not, WBC count, height, age at first menstruation, etc). When the model selection is done and computation is complete, the outcome is a logistic regression equation in the following format:

logarithm of odds = a + b₁x₁ + b₂x₂ + ... + b_nx_n

Here, logarithm of odds is the natural logarithm of the overall odds ratio for all variables included in the model and by using different values of the explanatory variables in the formula, different odds ratios for any combination of the values the variables can take (for example, being male or female, smoking status, previous history of vaccination, age, etc) can be calculated. Each coefficient (b) provides a measure of the degree of association between each variable and the outcome. This coefficient is the logarithm of the odds ratio for that variable (OR = e^b) controlled (adjusted or corrected) for the other variables in the model. It is also possible to calculate confidence intervals for the estimated odds ratio as well as the statistical significance of each coefficient ¹⁷. This property of logistic regression, which enables the calculation of controlled odds ratios, makes it unique in the analysis of multivariable data (Katz, 2003). The only concern is that depending on which variables are included in the model, these coefficients will change and the model selection is a relatively subjective procedure. Model selection should not be left to the computer, which would use statistical stepwise methods to choose the best selection, but the hypothesis and biological plausibility should be considered too. Logistic regression can be used for analysis of association with codominant loci where the disease risk associated with heterozygotes lies between that of two homozygotes but not in the specific relationship of a multiplicative or additive model {where risk increases by r-fold for heterozygotes and r²-fold for homozygotes for the risk allele in a multiplicative model; or r-fold for heterozygotes and 2r-fold for homozygotes on an additive model} (Lewis, 2002).

It is also possible to examine interactions between any combinations of variables using logistic regression. The interaction term will also have a coefficient but this does not correspond to an adjusted odds ratio. Some examples of the use of logistic regression in HLA research are, HLA sharing studies (outcome: fetal loss vs live birth) ⁵⁸, HLA effects on disease severity in rheumatoid arthritis ⁵⁹, and the effect of HLA compatibility on graft failure following bone marrow transplantation ⁶⁰.

Resampling Statistics: Resampling procedures (bootstrap, permutation and other simulations) have recently become more popular as the method of choice for hypothesis testing and estimation of confidence intervals. With resampling, the data are repeatedly resampled, if needed according to a model suggested by the data, to assess the variability of a statistic or estimate calculated from the observed data. The software Haploview can be used to do permutation test most easily. See Resampling: The New Statistics book by JL Simon, 1997.

Association and Causality?

However strong, an association does not necessarily mean causation. Several criteria have been proposed to assess the role of an associated marker in causation. Some of those are as follows:

1. Biological plausibility

2. Strength of association (this is not measured by the P value)

3. Dose response (are the heterozygotes intermediate between the two homozygotes, or is homozygosity showing a stronger association than just having the marker?)

4. Time sequence (this is inherent in the germ-line nature of HLA genes)

5. Consistency (see below for reasons for inconsistency in HLA association studies)

6. Specificity of the association to the disease studied

Some authors also include consideration of alternative explanations, coherence of the findings with current knowledge and experimental support into what is cumulatively called Hills's Criteria of Causality; see also Assessing Disease Associations and Interactions by Julian Little in Human Genome Epidemiology (a CDC book).

Why Are the Inconsistencies?

Large meta-analysis studies and systematic reviews have shown that at best no more than half of original associations can be consistently replicated. This lack of consistency is a widely recognized limitation of association studies, and is often ascribed to inadequate statistical power, population substructure, and population-specific linkage disequilibrium.

1. Mistakes in genotyping (lack of HWE in controls is usually an indication of problems with typing rather than selection, admixture, nonrandom mating or other reasons for violation of HWE)

2. Poor control selection (would your controls be in the case group if they had the disease, and would the cases be in your control group if they were free of the disease; are they from the same study base as cases; are they in any way matched to cases at least in age and sex?). Did you make any effort to establish a comparable control group or did you settle for a convenience group?

3. Design problems including the statistical power issue (negative results due to lack of statistical power should be distinguished from truly negative results)

4. Publication bias (are there many more studies with negative results but we have never heard about them?)

5. Disease misclassification or misclassification bias

6. Excessive type I errors (are the positive results due to using P < 0.05 as the statistical significance?)

7. Posthoc and subgroup analysis (are positive results are due to fishing?)

8. Unjustified multiple comparisons

9. Failure to consider the mode of inheritance in a genetic disease

10. Failure to account for the LD structure of the gene (only haplotype-tagging markers will show the association, other markers within the same gene may fail to show an association)

11. Likelihood that the gene studied account for a small proportion of the variability in risk

12. True variability among different populations in allele frequencies.

Most of the possible flaws listed above can be avoided by strict adherence to the standards of modern epidemiological studies (Potter, 2001 & 2003; Tabor, 2002; Little, 2002; Ransohoff, 2004; Hattersley, 2005).

Point-by-point list of evaluation criteria for genetic case-controls association studies has been generated (Silverman, 2000; Weiss, 2001; Healy, 2006) which are accessible online. Editorials in Nature Genetics (1999; PDF); Nature Medicine (2005; PDF) and other reviews (Cardon, 2001; Schork, 2001; Little, 2002; Campbell & Rudan, 2002; Cooper, 2002; Romero, 2002; Lewis, 2002; Colhoun, 2003; Little, 2004; Freimer, 2005; Hattersley & McCarthy, 2005; Healy, 2006) highlighted the features of a good association study which provides further guidelines for study design, execution, analysis and interpretation (see Checklist for Statistical Adequacy). The readers are strongly recommended to consult these before beginning their association studies. In general, an association study should aim for large datasets, small P values and independent replication if results are to be reliable (Vieland, 2001; Dahlman, 2002). See also STrengthening the Reporting of OBservational studies in Epidemiology (STROBE) checklists.

Basic statistical checklist for genetic association studies:

- In a case-control study:

Cases and controls derive from the same study base: YES or NO

There are more controls than cases (up to 5-to-1, for increased statistical power): YES or NO

There are at least 100 cases and 100 controls: YES or NO

- Statistical power calculations are presented (at least retrospectively): YES or NO

- Hardy-Weinberg equilibrium (HWE) is checked and appropriate tests are used: YES or NO (for HWE in genetic association studies, see Population Genetics)

- If HWE is violated, no allelic association test is used (because independence assumption is not met): YES or NO

- Possible genotyping errors and counter-measures are discussed: YES or NO

- All statistical tests are two-tailed: YES or NO

- Temptation to extract favorable P values from subgroups (data dredging) was resisted: YES or NO

- Alternative genetic models of association considered: YES or NO

- The choice of marker/allele/genotype frequency (for comparisons) is justified: YES or NO

- For HLA associations, a global test for association (G-test, RxC exact test) for each locus is used (if necessary, with correction for multiple testing): YES or NO

- Chi-squared and Fisher tests are NOT used interchangeably: YES or NO

- P values are presented without spurious accuracy (with two decimal places): YES or NO

- Strength of association has been measured (usually odds ratio and its 95% CI): YES or NO

- In a retrospective case-control study, ORs are presented (as opposed to RRs): YES or NO

- Multiple comparisons issue is handled appropriately (this does not necessarily mean Bonferroni corrections): YES or NO

- Alternative explanations for the observed associations (chance, bias, confounding) are discussed: YES or NO

If all of the answers to above questions are not YES, think again before submission of your manuscript (or asking a colleague to read it for you).

For population genetics concepts concerning the HLA field, see Basic Population Genetics Notes.

References

1. Bland M. An Introduction to Medical Statistics. Oxford: Oxford University Press, 1996.

2. Rothman KJ. No adjustments are needed for multiple comparisons. Epidemiology 1990; 1: 43-46.

3. Bland JM, Altman DG. Multiple significance tests: the Bonferroni method. British Medical Journal 1995; 310: 170-170.

4. Bulpitt CJ. Confidence intervals. Lancet 1987; 1: 494-497.

5. Tiwari JL, Terasaki PI. The data and statistical analysis. In: Tiwari JL, Terasaki PI, eds. HLA and Disease Associations, New York: Springer-Verlag, 1985: 18-22.

6. Altman DG. Practical Statistics for Medical Research. London: Chapman & Hall, 1992.

7. Sham P. Statistics in Human Genetics. London: Arnold, 1998.

8. Svejgaard A, Ryder LP. HLA and disease associations: detecting the strongest association. Tissue Antigens 1994; 43: 18-27.

9. Svejgaard A, Jersild C, Nielsen LS, Bodmer WF. HL-A antigens and disease. Statistical and genetical considerations. Tissue Antigens 1974; 4: 95-105.

10. Thomson G. A review of theoretical aspects of HLA and disease associations [Review]. Theoretical Population Biology 1981; 20: 168-208.

11. Perneger TV. What's wrong with Bonferroni adjustments. BMJ 1998; 316: 1236-1238.

12. Murray GD. Statistical aspects of research methodology. British Journal of Surgery 1991; 78 : 777-781.

13. Mantel N, Haenszel W. Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute 1959; 22: 719-748.

14. Klitz W, Aldrich CL, Fildes N, Horning SJ, Begovich AB. Localization of predisposition to Hodgkin disease in the HLA class II region. American Journal of Human Genetics 1994; 54: 497-505.

15. Richardson JTE. The analysis of 2x1 and 2x2 contingency tables: an historical review. Statistical Methods in Medical Research 1994; 3: 107-133.

16. Cochran WG. Some methods for strengthening the common X² tests. Biometrics 1954; 10: 417-451.

17. Rosner B. Fundamentals of Biostatistics. Belmont: Duxbury Press, 1999.

18. Daly LE, Bourke GJ. Interpretation and Uses of Medical Statistics. Oxford: Blackwell Scientific Publications, 2000.

19. Woolf B. On estimating the relation between blood group and disease. Annals of Human Genetics 1955; 19: 251-253.

20. Mantel N. Chi-square test with one degree of freedom: extensions of the Mantel Haenszel procedure. Journal of the American Statistical Association 1963; 58: 690-700.

21. Agresti A. Categorical Data Analysis. New York: John Wiley & Sons, 1990.

22. Kangave D. More enlightenment on the essence of applying Fisher's Exact test when testing for statistical significance using small sample data presented in a 2 x 2 table. West Afr J Med 1992; 11: 179-184.

23. Yates F. Contingency tables involving small numbers and the X² test. Journal of the Royal Statistical Society 1934; Suppl 1: 217-235 (JSTOR-UK)

24. Dyer P, Middleton D. Histocompatibility Testing: A Practical Approach. Oxford: IRL Press, 1993.

25. Dyer P, Warrens A. Appendix: statistical notes. In: Lechler R, ed. HLA and Disease, London: Academic Press, 1994: 113-121.

26. Haviland MG. Yates's correction for continuity and the analysis of 2x2 contingency tables. Statistics in Medicine 1990; 9: 363-383.

27. Conover WJ. Some reasons for not using the Yates continuity correction on 2x2 contingency tables. Journal of the American Statistical Association 1974; 69: 374-376.

28. Dorak MT, Mills KI, Poynton CH, Burnett AK. HLA and Hodgkin's disease [letter]. Leukemia 1996; 10: 1671-1672.

29. Armitage P, Colton T. Encyclopedia of Biostatistics. Chichester: John Wiley & Sons, 1998.

30. Bland JM, Altman DG. One and two sided tests of significance. British Medical Journal 1994; 309: 248-248.

31. Williams K. The failure of Pearson's goodness of fit statistics. Statistician 1976; 25: 49-49.

32. Hoff C. Chi-square trend analysis in antigen sharing studies. Tissue Antigens 1985; 26: 212-213.

33. Armitage P, Berry G. Statistical Methods in Medical Research. Oxford: Blackwell Scientific Publications, 1994.

34. Campbell MJ, Machin D. Medical Statistics. A Commonsense Approach. Chichester: John Wiley & Sons, Ltd, 1999.

35. Moses LE, Emerson JD, Hosseini H. Analyzing data from ordered categories. New England Journal of Medicine 1984; 311: 442-448.

36. Yasuda N, Tsuji K, Itakura K. HLA heterozygosity in children and old people. Tokai Journal of Experimental & Clinical Medicine 1980; 5: 165-169.

37. Glantz SA. Primer of Biostatistics. New York: McGraw Hill, 1997.

38. Sokal RR, Rohlf FJ. New York: W.H. Freeman & Company, 1994.

39. Wilks SS. The likelihood test of independence in contingency tables. Annals of Mathematical Statistics 1935; 6: 190-196.

40. Taylor GM, Gokhale DA, Crowther D, et al. Further investigation of the role of HLA-DPB1 in adult Hodgkin's disease (HD) suggests an influence on susceptibility to different HD subtypes. British Journal of Cancer 1999; 80: 1405-1411.

41. Yates F. Tests of significance for 2x2 contingency tables. Journal of the Royal Statistical Society 1984; A147: 426-463 (JSTOR-UK).

42. Berry G. Statistical significance and confidence intervals. Medical Journal of Australia 1986; 144: 618-619.

43. Gardner MJ, Altman DG. Confidence intervals rather than P values: estimation rather than hypothesis testing. British Medical Journal 1986; 292: 746-750.

44. Haldane JBS. The estimation and significance of the logarithm of a ratio of frequencies. Annals of Human Genetics 1956; 20: 309-311.

45. Zhang J, Yu KF. What's the relative risk? A method of correcting the odds ratio in cohort studies of common outcomes. JAMA 1998; 280: 1690-1691.

46. Zelen B. The analysis of several 2x2 contingency tables. Biometrica 1971; 58: 129-137.

47. Breslow NE, Day NE. Statistical Methods in Cancer Research: The Analysis of Case-Control Studies. Lyon: IARC, 1993.

48. Emerson JD. Combining estimates of the odds ratio: the state of the art. Statistical Methods in Medical Research 1994; 3: 157-178.

49. Mattiuz PL, Ihde D, Piazza A, Ceppelini R, Bodmer WF. New approaches to the population genetic and segregation analysis of the HL-A system. In: Terasaki P, ed. Histocompatibility Testing 1970, Copenhagen: Munksgaard, 1971: 193-205.

50. Schipper RF, D'Amaro J, de Lange P, Schreuder GM, van Rood JJ, Oudshoorn M. Validation of haplotype frequency estimation methods. Human Immunology 1998; 59: 518-523.

51. Yasuda N, Tsuji K. A counting method of maximum likelihood for estimating haplotype frequency in the HL-A system. Japanese Journal of Human Genetics 1975; 20: 1-15.

52. Long JC, Williams RC, Urbanek M. An E-M algorithm and testing strategy for multiple-locus haplotypes. American Journal of Human Genetics 1995; 56: 799-810.

53. Fienberg SE. The analysis of multidimensional contingency tables. Ecology 1970; 51: 419-433.

54. Weir BS, Wilson SR. Log-linear models for linked loci. Biometrics 1986; 42: 665-670.

55. Weir BS. Genetic Data Analysis II. Sunderland: Sinauer Associates, Inc. Publishers, 1996.

56. Kleinbaum DG, Kupper LL, Morgenstern H. Epidemiologic Research. Principles and Quantitative Methods. Belmont, CA: Lifetime Learning, 1982.

57. Hall GH, Round AP. Logistic regression - explanation and use. Journal of the Royal College of Physicians of London 1994; 28: 242-246.

58. Ober C, Hyslop T, Elias S, Weitkamp LR, Hauck WW. Human leukocyte antigen matching and fetal loss: results of a 10-year prospective study. Human Reproduction 1998; 13: 33-38.

59. Wagner U, Kaltenhauser S, Sauer H, et al. HLA markers and prediction of clinical course and outcome in rheumatoid arthritis. Arthritis & Rheumatism 1997; 40: 341-351.

60. Anasetti C, Amos D, Beatty PG, et al. Effect of HLA compatibility on engraftment of bone marrow transplants in patients with leukemia or lymphoma. New England Journal of Medicine 1989; 320: 197-204.

Internet Links

Extensive Epidemiology and Biostatistics Links

ASHI 2001 Biostatistics and Population Genetics Workshop Notes EFI 2006 Teaching Session Notes

Human Genome Epidemiology Online Book

Analysis of Association, Confounding and Interaction

Experimental Design and Statistical Analysis in Genetic Association Studies

GENESTAT: Genetic Association Studies Portal (see Ripatti, 2009)

Online Statistics

Fisher's and Chi-squared Analysis of a 2x2 Table: (1) (2) (3)

Loglinear Analysis Logistic Regression McNemar's Test Bonferroni Correction

GenePop: Online LD Analysis & Online HWE Analysis

DeFinetti for Online HWE Analysis for SNP Data