HLA     MHC     Genetics    Population Genetics    Genetic Epidemiology (Glossary)   Bias & Confounding    Infection & Immunity    Biostatistics     Homepage

 

STATISTICAL ANALYSIS IN

HLA AND DISEASE ASSOCIATION STUDIES

M.Tevfik Dorak

 

See also 'Common Concepts in Statistics' and ‘Pitfalls in Genetic Association Studies {PPT}

Case-Control Association Studies by C Lewis (EGF Summer School Notes)

The Lancet Septet on Genetic Epidemiology: Editorial, ( I ), ( II ), ( III )*, ( IV ), ( V )**, ( VI ), ( VII ) 

*  Genetic Association Studies (Cordell & Clayton, 2005)

**   What Makes a Good Association Study (Hattersley & McCarthy, 2005)

Nature Reviews Genetics Web Focus: Statistical Analysis: Editorial, ( I )***, ( II ), ( III ), ( IV )

***   Tutorial on Statistical Analysis of Population Association Studies (DJ Balding, 2006)

Population-Based Association Studies Lecture by David Clayton 

Wellcome Trust Centre for Human Genetics: Course on the Design and Analysis of Disease-Marker Association Studies

Nature Medicine: Editorial on Statistical Analysis & Statistical Guidelines - Checklist  

'Reporting, Appraising, and Integrating Data on Genotype Prevalence and Gene-Disease Associations (2002)'

'Bias and Confounding in Molecular Epidemiologic Studies by Vineis & McMichael (1998)'

'Statistical Approach to Molecular Epidemiology Studies & Editorial, JNCI (2004)'

'Human Genome Epidemiology Online Book (2004)'

 

Earlier versions of these notes formed the basis of the Workshop at BSHI 2002 Meeting 

 

What is an association study?

Linkage and association studies are the two major types of investigations to determine the contribution of genes to disease susceptibility (or any other phenotype). While linkage studies can only use family data, association studies can be family or population based. In addition to having wider applications, association studies are considered to be more sensitive (having greater power) than linkage methods in a comparably sized study. Allelic association is useful to refine the location of major genes prior to positional cloning. This has been the main use of association studies: to map the unknown disease gene, which is presumed to be in LD with the associated gene, and to proceed to positional cloning of the disease gene (see Genetic Epidemiology). With the information obtained from the Human Genome Project, the focus of association studies is changing to examining the role played by recently discovered genes in disease development (Burke, 2003).

 

HLA genes have been known for a long time and the whole of the HLA complex has been sequenced (The MHC Sequencing Consortium, 1999; Allcock, 2002; Shiina, 2004; Horton, 2004), so why are we still doing HLA association studies? One of the many reasons is to identify disease-specific susceptibility (risk) and protective markers that can be used in immunogenetic profiling, risk assessment and therapeutic decisions. Perhaps a more important reason is to refine already known or constantly emerging associations in the light of expanded knowledge of the HLA genetic map. In the beginning, only 'classic' HLA genes and few class III genes were known and could be included even in the most comprehensive association studies. The latest and perhaps ultimate map of the HLA complex contains more than 100 functional genes and the task is now to figure out the real association behind HLA associations. Several examples have appeared recently that however strong, some HLA associations are due to non-HLA genes. One has to remember that hemochromatosis and congenital adrenal hyperplasia originally showed associations with HLA antigens but not with HFE or CYP21A2. Therefore, from basic science point of view, the ultimate goal of HLA association studies is to find out how genes cause the disease or modify susceptibility or course of it. In the modern era of HLA association studies, the conceptualization, design, execution and interpretation have gained new dimensions. Hypothesis-driven studies are replacing fishing expeditions, extreme polymorphism of HLA genes forces researchers to take statistical power into consideration in sample size determination more frequently, and to consider replication more than ever to avoid well-recognized complications of multiple comparisons. Ever increasing global ethnic diversity, on the other hand, requires extraordinary care to prevent spurious associations resulting from population stratification. A large number of sophisticated methods have been developed by biostatisticians and statistical geneticists to handle the data generated from comprehensive association studies especially in diseases with complex genetic basis (Lewis, 2002; Forabosco, 2005; Cordell, 2005). This review is on the basic principles of traditional HLA association studies in terms of statistical interpretation with the ultimate aim of providing an insight to the modern methods of association data analysis.

 

 

General considerations and the multiple comparisons issue

The most common ways to make an inference for the population from which the analyzed sample has been drawn are significance probability calculation or confidence interval (CI) estimation. A CI contains the result of a significance test, but a significance test cannot provide the confidence limit. The statistical tests are essentially a test of whether the confidence limits include unity.

Each experiment starts with a null hypothesis which simply states that there is no difference between the two samples; that is 'two samples have been drawn from the same population.' A test statistics generally relies on the comparison with the observed distribution of what is expected if the null hypothesis is true. If the null hypothesis is rejected by statistical testing, the alternative hypothesis (the effect is not zero; there is a difference, etc.) is accepted. The general test statistics is as follows:

test statistics = (observed value - hypothesized value) / standard error of the observed value

The choice of the test to do this statistics depends on what is being compared. The usual tests used in HLA association studies and also in population genetic analysis of the HLA system are briefly described later. Test statistics yield a probability of observing the value we found (or an even more extreme value) when the null hypothesis is true. If the statistics gives a P (probability of error) value of <0.05, this means that the observed value would have a probability of occurrence somewhere in the extreme 5% of the relevant distribution curve for the data and imply an unusual finding.

In significance testing, a type I error (a false-positive finding) is considered to be more serious, and therefore more important to avoid, than a type II error (missing a positive finding). If 0.05 is the pre-determined level of statistical significance, this equals to reaching one incorrect conclusion (false positive) with no biological relevance in twenty tests 1. The value 0.05 is the probability of getting an erroneous result every time one compares two groups, which are in fact equivalent. This level of error can be accepted in the context of a single test of a prespecified hypothesis. The probability of getting at least one statistically significant result as a result of type I error in multiple comparisons is, however, much higher than this. This probability can be calculated with the formula: 1-(1-a )n, where n is the number of comparisons and a is the arbitrarily chosen significance level which corresponds to the accepted failure rate for each comparison 1-4. If the probability of getting the right result is 0.95 (P = 0.05) in a single test, it would be (0.95)20 = 0.36 in 20 tests. Thus, the probability of getting it wrong is 1 - 0.36 = 0.64. This is to say that, for 20 comparisons and the significance level of 0.05, the probability of getting at least one erroneous result is 0.64 (which is 0.92 for 50 comparisons). This probability is smaller when the significance level is lower (0.18 for 20 comparisons and the significance level of 0.01). This argument applies to situations when the multiple comparison tests are independent tests.

To avoid a type I error, when independent multiple comparisons are carried out, and all genotypes examined have the same chance of being increased or decreased, a statistical safeguard should be applied. There is no definite limit for the number of comparisons that makes this necessary but it is a must if it is greater than 20. This is usually done by multiplying the P value by the number of comparisons (Bonferroni inequality method) 3;5;6. This is basically lowering the statistical significance level and making the statistical test more conservative. For small numbers of comparisons (say up to five) its use is considered to be reasonable, but for larger numbers and for the multiple tests that are not independent (highly correlated), it is highly conservative 1;6 (see also Bland & Altman, 1995). The precise correction is obtained by the formula:

Pcorrected = 1 - (1-P)n

where P is the uncorrected P value, and n is the number of comparisons 7. The number to use for correction is not (as frequently done) the number of alleles or genotypes detected in the study but the number of comparisons one or more of which shows a significant result 8 (see below). For P values of less than 0.01, this formula gives almost the same result as simple multiplication 8. If the 'corrected' P value is still less than the pre-determined significance level (such as 0.05), then the result is significant.

There has always been a debate about what number to use in the multiplication. In contrast to the most common practice, this is not supposed to be just the number of alleles in the locus analyzed 9. If the allele frequencies are compared between patients and controls, the number of alleles is important as well as the number of comparisons in terms of age groups, sex groups, clinical subgroups, etc. When two antigens in linkage disequilibrium are investigated in an association study, the correction issue becomes slightly more complicated. Svejgaard & Ryder 8 has recently discussed this problem. When two loci without linkage disequilibrium between them are studied, Svejgaard recommends considering them separately in multiplication 9.

Another way to correct for the bias due to multiple comparisons is to do a second study to check the frequency of the same specificity in the same disease 5;9;10. If the uncorrected P value is £ 0.05 as in the first study, then the second study has confirmed the results of the preliminary study. The second study is said to be done with a specific hypothesis, therefore, an expected association would not be regarded as an artifact of multiple comparisons.

It has been argued, however, that by using the multiplication rule to avoid type I errors, the risk for type II errors increases 2;11. This means that some genuine finding may be ruled out as a chance finding where it is worth pursuing. This is a good point since in association studies in a disease with multifactorial etiology, there will rarely be a very significant association when a single individual factor is considered. Any significant association, regardless of the corrected P value, should be critically evaluated and pursued further if biologically plausible. As mentioned above, a second study may offer the best explanation to these kinds of marginal findings.

Murray has proposed another method for the design of the study, which would preserve the sensitivity while avoiding a high, risk of type I error 12. This method may be most useful for HLA association studies. This is to set down a priori a hierarchy of comparisons of interest. There should be a single comparison of primary interest, upon which the sample size (power) calculation would be based, and the analysis of which would be taken at face value. Thus the sensitivity is preserved for the most important question, since one has to achieve P < 0.05 in order to claim significance. There would then be a limited number of prespecified secondary comparisons, which would carry a lesser weight. But if significant after the correction for multiple comparisons, it cannot be too lightly dismissed. After that, any other comparisons of interest would be made in a purely exploratory fashion, in an attempt to generate hypotheses for future studies. Even the most extreme P values emerging in these final comparisons should be given very little weight until there is independent supporting evidence from other studies (a second study based on a specific hypothesis).

In their discussion of the statistical analysis of retrospective studies 13, Mantel and Haenszel state that "if the purpose of the retrospective study is to uncover leads for fuller investigation, it becomes clear there is no real multiple significance testing problem. A single retrospective study does not yield conclusions, only leads." This situation corresponds to a full HLA locus analysis in a disease with no a priori hypothesis for the nature of an association. This effort would then be called a hypothesis generation approach. Mantel and Haenszel go on to say "also, the problem does not exist when several retrospective and other type studies are at hand, since the inferences will be based on a collation of evidence, the degree of agreement and reproducibility among studies, and their consistency with other types of available evidence, and not on the findings of a single study." In the HLA field, this is the case when there are several independent studies on the same association in the same disease or when there is a second study performed by a hypothesis.

A specific example for the application of Murray's proposition in the light of Mantel and Haenszel's view to the HLA field would be something like that: a study can be designed to investigate the relevance of homozygosity for an HLA class II supertype in childhood leukaemias as the same genotype has been found to be increased in two adult leukaemias. Therefore, the specific a priori hypothesis is that homozygosity for this supertype is increased in childhood leukaemias. The comparison of its frequency between patients and controls would not require any statistical safeguard for multiple comparisons. If an increase is shown, then one would like to check the individual members of this supertypic family to see whether this increase is due to an increase in any of them. These comparisons would require the correction for the multiple comparison before any increase is declared to be significant. The third step would be to analyze all HLA-DR types since the data is available anyway. If any of the other HLA-DR types is found to be increased or decreased, caution should be taken in making a big event out of this. Such a finding should be subject to confirmation with a second study unless, of course, the P value can already stand multiplication by the number of all these comparisons made.

More recently, Klitz suggested that if a G-test is applied to the overall distribution of genotype frequencies between patients and controls and if this yields a significant result, the individual associations cannot be taken as the result of multiple comparisons 14. He reminds us that looking for an association using individual 2x2 tables for each antigen is the old method stemmed from the presence of blank alleles at the time of serological typing. This would certainly lead to the multiple comparisons problems. Nowadays, blank alleles are not a problem anymore and a G-test (or C2-test) on a 2xN table, where N is the number of alleles/haplotypes or genotypes, is a sensible approach. A significant deviation in this analysis justifies further scrutiny of the data to single out the main association. Klitz also suggests grouping of very rare alleles in one category to increase the sensitivity of the test.

Confusion may occasionally arise through wrong usage of the terms allele, gene, or marker in an association study. Some investigators state that they compare allele or haplotype frequencies, but only count each individual once. They, therefore, refer to what used to be phenotype frequencies in serological HLA studies, or in the case of genotyping studies, to marker frequencies (MF), which correspond to inferred phenotype frequencies if it is an expressed genotype. Allele (AF) or haplotype frequency (HF) is analogous to gene frequency (GF) in that they are always calculated in terms of the total number of chromosomes not individuals.

For a comprehensive discussion of the multiple comparisons issue, see Intuitive Statistics: Multiple Comparisons.

Case-control design and related issues

 

Population stratification can be thought of as confounding by ethnicity. If ethnicity of cases and controls are reliably known, a stratified analysis would eliminate this problem. However, it is the unknown stratification within the population that causes this undesirable effect. To overcome this problem, software designed by Pritchard, Structure and Strat can be used with genomic controls (Devlin, 1999; Pritchard, 1999).

 

In case-control studies, ideally there should be at least one control per case. If the number of cases is limited and cannot be increased easily, it may be an idea to increase the number of control to increase statistical confidence. The law of diminishing returns, however, dictates that a maximum of five controls per one case is the limit (Epidemiology for the Uninitiated by Coggon, Rose and Barker. BMJ Publishing Group, 1997). A higher control to case ratio will not provide further benefit and may even result in type I errors. 

 

The following quality control questions for clinical case-control studies also apply to genetic association studies:

1. Were the groups similar apart from the exposure under question?

2. Were the data on outcomes collected identically in both groups?

3. Is there a dose-response effect?

4. Does the relationship make biological and chronological sense?

5. How strong is the association and how precise is the estimate?

 

Power calculations and expectation should be realistic in a genetic association study with a complex (multifactorial) disease. Genetic susceptibility to such diseases involves a large number of alleles, each conferring only a small genotypic risk (like OR = 1.2 to 2.0), that combine additively or multiplicatively to confer a range of susceptibilities in interaction with environmental factors. Apart from increasing the numbers of cases and controls to have greater power and validity, hypernormal controls or genetically severe cases (like cases selected for a family history of disease) can be used. Other strategies to increase statistical power in an association study are the use of haplotypes rather than alleles and even an ancestral haplotype approach in a cladistic haplotype analysis. Underpowered studies are one of the main reasons for failure to replicate an initial association study. 

Significance testing

Constructing a 2x2 contingency table

When two groups are to be compared in a case-control study, it is necessary to have a 2x2 contingency table cross-tabulating the frequencies. This table is required for significance testing, relative risk (RR) or odds ratio (OR) estimation and CI calculation. A contingency table may have more than two rows and columns but the 2x2 table approach stems from classic case-control studies in epidemiology as the most elemental data structure leading to ideas of association 15. The current trend in genetic association analysis is to consider the genotypes as the genetic factor or alleles if a multiplicative risk model is appropriate (Sasieni, 1997; Lewis, 2002; Cordell, 2005), see Genetic Models. For a multiallelic locus like the HLA loci, however, the number of genotype categories is large. This is why originally the comparisons based on the presence or absence of an allele were thought to be more parsimonious and was established as the standard approach for HLA association studies 5. Another reason for using individual 2x2 tables for each antigen in HLA association studies was the presence of blank alleles at the time of serological typing (which is no longer the case) 14. This contingency table is one of several ways of analysis genetic association study data and does not take into account underlying genetic model {recessive, dominant, codominant, additive, multiplicative) (Lewis, 2002; Sellers, 2004; Cordell, 2005). HLA association tests relying on the presence of an allele in cases vs controls implicitly use dominant genetic model (carrying at least a copy of the allele is all that matters) but in compliance with the current trends, all genetic models should be explored if the genetic model for susceptibility is unknown (Lewis, 2002; Sellers, 2004; for online genetic model-based analysis: DeFinetti; MODEL).

A general layout of a contingency table for a conventional HLA-disease association study is as follows:

 

 

 

allele i

 

 

 

 

Present

 

Absent

Row totals

 

Patients

a

 

b

a+b

Outcome

 

 

 

 

 

 

Controls

c

 

d

c+d

 

Column totals

a+c

 

b+d

N=a+b+c+d

The number in each cell is a count, i.e., a non-negative integer. Each subject must appear in one, and only in one, cell. In the table, switching rows and columns will not alter the result. When a contingency table is subject to a hypothesis testing, the null hypothesis is that there is no association between the two variables, the alternative being that there is an association of any kind (two-sided test). To do the hypothesis test, it is generally necessary to calculate the expected frequencies for each cell (see below). If the Chi-squared test is to be used, all expected frequencies should be at least 1 and at least 80% of them should be more than 5 (Cochran rule) 16. This means that for a 2x2 table, all expected frequencies should be more than 5. As will be discussed below, the best approach is to use an exact test which is not based on approximation or assumptions and does not require any kind of correction. Such tests used to computationally highly demanding but this is no longer the case.

Choice of a statistical test

The usual choice for testing the significance of an HLA-disease association studies is the Fisher’s exact test 9. This test is described below. Several other tests based on approximations can also be applied for the same analysis 15;17. These are the ordinary Chi-squared test (with or without Yates's correction) 18, two-sample Z-test 6;18, Woolf's method 19, and the Mantel-Haenszel C2 test 13;20;21. When one or more of the expected numbers in the 2x2 table is less than five, and when the overall sample size is small, the Fisher’s test is commonly believed to be the only reliable one 9;17;22, although there are also strong arguments against this 15.

Yates's correction for discontinuity

If the C2 test is used in the statistical analysis of the results, it may be necessary to use the Yates's continuity correction for small samples 15;17;23. The Chi-squared distribution is continuous. That means that the curve of the C2 distribution model is continuous without any breaks. The values we calculate in C2 are, however, discrete values. This is because observed frequencies vary in discrete units (the number of occurrences of a gene -the entries in the 2x2 table- may be 4 or 5 but not 4.6). With degrees of freedom greater than 1 and with expected frequencies of at least 5 in each cell, this is not a problem as the difference between the statistics and the true sampling distribution is so small. For example, the difference between 100 and 101 is negligible (1%) compared to the difference between 4 and 5 (25%). Any difference between observed and expected frequencies will appear large when cell frequencies are small and may result in a type I error.

The Yates's correction helps make the discrete data generated by the test statistics [å (O-E)2/E] more closely approximate to the continuous Chi-squared distribution. This is achieved by changing the above formula to [å (½O-E½ -0.5)2/E]. By doing so, the discrete data distribution and continuous data distribution are approximated better. This will result in a smaller calculated value of C2 and will reduce the risk of a type I error. In relatively large samples, this would not make an important difference. Most textbooks recommend that the Yates's continuity correction should be applied when the sample size is small or the contingency table contains any number less than 10 24;25. This correction, however, tends to overcompensate for discontinuity and may result in a more conservative decision than necessary 15;26. As a simple rule, if a result is still significant after correction, or already non-significant without it, there is no problem. When a significant result becomes non-significant with the correction, this might be due to overcompensation. It is argued that the Yates’s correction produces severe conservative bias in the C2 test for association (C2 test for independence or homogeneity) where the observed marginal frequencies in the contingency table are subject to sampling variability 15;26;27. Some statisticians strongly argue that the Yates’s correction is only necessary when the experimenter fixed both sets of marginal totals and when the cell frequencies are small 15;26. For other types of contingency tables, Yates’s correction may indeed be too conservative but the current tendency is stronger towards its usage 1.

Significance level

In the study of biological variations, a P value of less than 0.05 is generally considered to be statistically significant. In HLA association studies, because of the nature of the HLA system and random variations in gene frequencies, it is not infrequent that a P value of this magnitude is obtained. When the significance level is chosen as P = 0.05, most reported associations cannot be confirmed in subsequent studies 8. It has been suggested that a P value of less than 0.01 should be used for the significance of an HLA association, and the description 'highly significant' should be reserved for P values less than 0.001 (Ref. 8).

Occasionally, a relative risk (RR) or odds ratio (OR) is quoted as a result of comparison between patients and controls without a P value or CI. A RR may be more than 1.0 or even greater, especially in small groups, but without statistical significance it is meaningless. If CIs are calculated for such RRs (i.e., not associated with a P £ 0.05), it would be a very wide one extending to both sides of 1.0. In one report, a RR similar to the one reported for an HLA association in Hodgkin's disease can be found even for comparisons between two control groups of the same study 28. A RR/OR should always be reported together with its statistical interpretation and CIs.

The obtained P value (or significance probability) should be interpreted properly. Statistics does not prove the truth of anything, it just provides more or less evidence for its validity. The significance probability describes the extent to which the data support the null hypothesis: if the statistical experiment were to be repeated on many subsequent occasions and if the null hypothesis were true, the P value represents the proportion of further experiments that would support the null hypothesis 29. More technically, it is the probability of obtaining a value of the test statistics as large as or larger than the one computed from the data when in reality there is no difference between the proportions. Basically, a P value is the probability of being wrong when asserting that a true difference exists. If this probability is less than 5%, we usually reject the null hypothesis. Just as P £ 0.05 does not prove anything, P > 0.05 does not necessarily mean the null hypothesis is correct. These may be due to type I and type II errors, respectively. In any case, a significant result only adds weight to the alternative hypothesis. A negative result should be evaluated taking into account the power of the test. If a genotype occurs in two patients (n = 100), its total absence in 300 controls would yield a P value of 0.06 (Fisher’s exact test) with a RR of 15.3 (Woolf-Haldane analysis). This result cannot be easily dismissed as non-significant.

One-sided or two-sided test?

Some comparisons, like those between two means or two proportions, can be evaluated by one-sided or two-sided P values (all comparisons of three or more groups are two-sided) 30. When a P value is chosen, say P = 0.05, we assume that if the difference between the two values (p1 and p2) is so unusual that if we repeat the sampling from the population 100 times, in 95 of these, the difference will be bigger than the observed, we will consider this difference significant. But we do not state the direction of the difference (both p2-p1 and p1-p2 are of interest). If we are evaluating a difference in either direction, a two-sided P value is required. If, however, the direction of the difference is stated in the hypothesis; for example, if the alternative hypothesis is that there is a difference and p1 is greater than p2, the two proportions can be evaluated by a one-sided P value. However, Bland & Altman 30 state that "in general, a one-sided test is appropriate when a large difference in one direction would lead to the same action as no difference at all. Expectation of a difference in a particular direction is not adequate justification". Inappropriate use of the one-sided test would double the risk of a spurious significant difference. Two-sided tests should always be preferable unless there is a very good reason for doing otherwise.

Technically, the two-sided P value corresponds to the total area in both ends (tails) of the distribution curve for symmetrical distributions. If a one-sided P value is going to be used, it will correspond to the area at one tail of the curve and it will be half the value of the two-sided P value. A C2-test with 1 d.f. is essentially a two-sided test, whereas Fisher's test usually gives a one-sided P value and there are several ways to calculate the two-sided P value.

Degrees of freedom

The number of degrees of freedom is a way of referring to the number of independent variables involved. This is usually one less than the total number of variables. In a 2x2 table containing the number of patients and controls for the presence or absence of an allele or genotype, the degrees of freedom is one. This is because the number of independent variables is one. If anybody has an allele, that person cannot also be negative for it. Similarly, a person is either a patient or a control. The general rule to calculate the degrees of freedom is to multiply the degrees of freedom for the rows and columns [i.e., d.f. = (r-1)(c-1)] 1;25.

In the study of a bi-allelic locus, despite having three possible genotypes (say, AA, Aa, aa) the degrees of freedom is still one. These three genotypes are not independent of one another. Given the gene frequency of either A or a, the other one's frequency, and subsequently all genotype frequencies can be calculated. Thus, the number of independent variables is one. If a study is investigating the association of five different independent phenotypic markers in patients and controls (two groups), the degrees of freedom will be (5-1)(2-1) = 4. The 'degrees of freedom' is taken into account in the translation of the C2 value to a P value. This is because the C2 distribution is different for each degrees of freedom.

Significance tests

For the most common situation in the HLA field -the comparison of two independent proportions (p1, p2)- equivalent hypothesis tests, the Chi-squared test, the two-sample Z-test, the G-test, or the Fisher’s exact test can be used 6;17 (for a link to Online Analysis of a 2x2 Table, see the end of this document).

Chi-squared test

To calculate the C2 for the same data, a 2x2 table is first constructed, actual numbers of occurrences are placed, and the expected frequencies in each cell are calculated. The expected frequency in a cell is the product of the relevant row and column totals divided by the sample size (grand total; N = a+b+c+d). For the cell with observed frequency 'a', for example, the expected value is (a+b)(a+c)/N. The difference between observed (O) and expected (E) values (residual) is the same for each cell but with different signs (- or +). This means that there is only one independent observation rather than four, so just one degree of freedom. The difference between O and E for each cell (O-E) is then calculated. The C2 is calculated as the sum of (O-E)2/E for all four cells:

C2 = å [(O-E)2/E]

Intuitively, as the expected frequencies are calculated for each cell, our entries should be observed frequencies (there are other tests to compare an observed proportion with an expected one). The Chi-squared test relies on an normal approximation to the distribution of the cell counts, and this approximation may be poor for small sample sizes.

Alternatively, a different version of the C2 formula can be used 6;24. This version uses the observed frequencies in cells and avoids the need to calculate the expected values explicitly:

C2 = [N (ad-bc)2] / [(a+b) (a+c) (b+d) (c+d)]

This formula for C2 for a 2x2 table is mathematically identical to the general formula for C2 but can only be justified for large samples 15. As for the Z-test, the P value for the C2 test is obtained from tables.

There is one well-justified argument against the use of the Chi-squared test. Williams reminds that Pearson originally developed the goodness of fit test as a computationally convenient approximation of the maximum likelihood statistics or the G-test 31. It can be shown mathematically that the approximation is valid if the deviation of observed from expected frequencies is smaller than the expected frequency (½O-E½ < E ). Thus, if even in one cell, ½O-E½ > E, the Chi-squared test should not be used. The Chi-squared distribution is usually poor for the test statistics G2 when N / rc is smaller than five 21. Williams also reminds us that it is now computationally easy to calculate maximum likelihood statistics and there seems to be no reason for the continued use of Pearson's approximation.

A common mistake in the analysis of HLA sharing data is to use the ordinary C2-test when C2-trend test is more appropriate 32. When there are more than two categories (number of antigens shared in two loci is 1 to 4) compared between two groups (those having recurrent spontaneous miscarriages and fertile couples), and there is an ordered increasing or decreasing difference along the N categories, the appropriate test to use is a trend test for 2xN tables 17;33;34. In this situation, the C2-test completely ignores the order of columns. Alternative tests suitable for similar situations are the Wilcoxon-Mann-Whitney test or the t-test with use of ordered scores 33;35. In addition to the fetal loss studies, the trend test has also been used in an analysis of increasing heterozygosity in the HLA-A and -B loci in the elderly compared to children 36. (See Analysis of Association, Confounding and Interaction for an explanation of the trend test.)

Armitage trend test can also be used for case-control study association data when the underlying genetic model is presumed to be additive {where risk increases by r-fold for heterozygotes and 2r-fold for homozygotes for the risk allele} (Lewis, 2002). This gradual increase is best assessed by the trend test by collapsing all other alleles into one category so that there will be three genotypes to compare in cases and controls. For SNP data, usually there are three genotypes (AA, AB, BB) and the trend test is the best test if the genetic model is additive. One other advantage of trend test is its robustness against deviations from HWE (Sasieni, 1997; Xu, 2002).

Another misuse of the Chi-squared test is to use it when McNemar's test should be used. The Chi-squared test of independence of the two variables in a 2x2 table provides a test for the equality of two proportions when these proportions are estimated from two independent samples. If the two variables have derived from the same individuals, the Chi-squared test cannot be used as the variables are not from independent samples. The appropriate test for equality of proportions in paired samples is the McNemar's test 17;37. This test finds its use in situations where the same sample is used to find out the agreement (concordance) of two diagnostic tests or difference (discordance) between two treatments. The difference is similar to that of two-sample t-test and paired t-test for interval data. Like the trend test, the McNemar's test gives a smaller P value than the ordinary Chi-squared test for the same 2x2 table. Thus, it is again for the benefit of the researcher to use it when it is more appropriate to use. The calculation is not much different from what is done to obtain the ordinary C2 value, but only the discordant values in the 2x2 table are included (with Yates's correction) 37 (Online McNemar's Test (1) (2)).

G-Test

The likelihood ratio (Chi-squared) test or maximum likelihood statistics are usually known as the G-test or G-statistics 38. Whenever a Chi-squared test can be employed, it can be replaced by the G-test. In fact, the Chi-squared test is an approximation of the log-likelihood ratio, which is the basis of the G-test. Pearson originally worked out this approximation because the computation of the log-likelihood was inconvenient (but it no longer is). The Pearson's statistics, C2 = å [(O-E)2/E] is mathematically an approximation to the log-likelihood ratio or G= 2 å O ln (O/E)

The value called G approximates to the C2 distribution. The G value can also be expressed as

G = 2 [å O lnO - å O lnE] = 4.60517 [å O log10O - å O log10E]

The G-test as calculated above is as applicable as a test for goodness of fit using the same number of degrees of freedom as for Chi-squared test. It should be preferred when for any cell ½O-E½ > E.

For the analysis of a contingency table for independence, Wilks 39 formulated the calculation of the G statistics as follows:

G = 2 [ å å fij ln fij - å Ri ln Ri - å Cj ln Cj + N ln N ]

where fij represents entries in each cell, Ri represents each row total, Cj represents each column total, and N is the sample size. The same formula can be written using logarithm base 10 as follows:

G = 4.60517 [ å å fij log10 fij - å Ri log10 Ri - å Cj log10 Cj + N log10 N ]

The G value approximates to C2 with d.f. = (r-1)(c-1). When necessary, Yates' correction should still be used and the formula needs to be modified accordingly. With the exception of the above mentioned condition that ½ O-E½ should be smaller than E for the Chi-squared test to be valid, there is not much difference between the two tests and they should result in the same conclusion. When they give different results, the G-test may be more meaningful. The G-test has been gaining popularity in HLA and disease association studies 14;40 (Online G-test).

Fisher’s exact test

When the Chi-squared test is used for small samples, Yates's correction may need to be used to make adjustments for continuity. This correction, however, does not remove the requirement for the expected frequencies. As mentioned above, the expected frequencies in each cell (of a 2x2 table) should be at least five for the C2 test to be reliable (note that the observed frequencies may be less than five). The general belief is that when there is one or more expected frequency of less than five, the alternative approach for significance testing is the Fisher’s exact test 1;6;9;38. There is no equivalent method of estimation for comparing proportions from very small samples and Fisher test is the best choice for such data 22. The result of a Fisher’s test is usually more or less the same as the result of the Chi-squared test with Yates's correction. Fisher's exact test does not depend on an approximation to the probability distribution of the cell counts. There are, however, statisticians defending the idea that this is a misconception in statistics and the Fisher’s exact test is not suitable for small sample sizes 15. Yates himself replied to such criticism and defended the suitability of both Fisher's test and Yates's correction where appropriate 41. He also quotes Fisher as stating that the exact test should always be used whether or not the margins are determined in advance.

The Fisher’s exact test also uses the frequencies in a 2x2 table but the calculations are different from that of the other significance tests. It is based on the observed row and column totals. The method consists of constructing all possible 2x2 tables giving the same row and column totals as the observed data. For each table, probability for such data to arise if the null hypothesis is true is calculated. Then, the overall probability of getting the observed data is calculated. To get the probability, the probability of the observed data and all other probabilities for alternative 2x2 tables equal or more extreme than that of the observed data are added up 6. Calculations are mathematically very complex, include factorial calculations, which may be very cumbersome for big numbers. The Fisher’s exact test can now be applied by means of computers (which can cope with the calculations of factorials for large numbers) even to large samples. This test is essentially one-sided. To get the two-sided P value, if required, the P value may be doubled 1 (to perform Fisher's Exact Test online, see the links at the end). There are exact tests available other than Fisher's and one of them RxC is designed by Mark Miller and can perform the exact test for large contingency tables.

It is still common to mix the results of Chi-squared and Fisher's exact test in the same study. If some comparisons require the use of Fisher's exact test because of small expected numbers then it is best to use the Fisher's test for all comparisons. It does not make much sense to report some results by one test and others by another when all can be done most reliably by an exact test.

Two-sample Z-test

For the Z-test, only the two proportions (p1, p2) together with the sample sizes (n1, n2), and the estimated proportion (pe) in the population are required 6;18. The desired condition for the validity of this test is large enough (³ 30) size of both samples. The basic formula for this test is:

Z = (p1-p2) / SEdiff

The calculation of the SE of the difference (SEdiff) is slightly different for this test. First, the expected (combined) proportion in the population is calculated:

pe = (p1+p2) / (n1+n2)

This is then used to calculate SEdiff:

SEdiff = [pe(1-pe) (1/n1+1/n2)]1/2

One-sample Z-test

In HLA studies, it may be necessary to compare an observed value with an expected value. The expected value may be a homozygosity rate calculated from the gene or haplotype frequency in the same sample of the population assuming Hardy-Weinberg equilibrium.

The statistical test for the null hypothesis 'population proportion is equal to a specified proportion' is the Z-test for single proportion 6;18. The Z value for this comparison is calculated as follows:

Z = po-pe / [po(1-po)/N]1/2

where po is the observed and pe is the expected proportion. The corresponding P value for the Z statistics can be found from the standard normal distribution tables. For a two-sided test, if Z is greater than 1.96 (or smaller than - 1.96), P < 0.05.

Confidence interval estimation

When a value (a proportion or OR obtained from a sample) is the estimate of an unknown "true" value (within the population from where the sample has been drawn), CIs can be applied to them. CI is more informative than the simple results of hypothesis tests, where a null hypothesis is rejected or accepted. One of the traps of the hypothesis testing is that non-significance may be equated with accepting the null hypothesis whereas it may just be a failure to reject it (due to a small sample size, for example). In the case of comparing two groups, a CI enables the researcher to see how large the difference between two proportions may be, not simply whether it is different from zero. It also shows whether a non-significant result suggests an outright rejection of the alternative hypothesis -lack of difference, i.e., CI includes zero and does not reach the critical value for the difference-, or an inconclusive result, resulting from lack of evidence (CI includes zero but also exceeds the critical value) 42.

CIs provide a range of plausible values for the unknown 'constant' parameter (such as the mean or expected frequency in the population from which the sample has been drawn). CIs can be calculated for different confidence levels. If a CI is calculated at a 95% level (as usually done), 5% of the time the true population parameter will not be contained within the interval calculated from the sample statistics. More technically, it means that 95% of all samples drawn from the population will have the population parameter within this interval. In a way, this is the acknowledgement of the fact that a different sample from the same population may produce a different result.

The width of the CI also gives us an idea about how uncertain we are about the unknown population parameter (the mean, for example). The most common CIs are calculated for a mean (or a single proportion), for the difference between two means and for a RR or OR. A very wide interval may indicate that the sample size should be increased to be more confident about the parameter. If the hypothesis test yields a significant result at the 5% level, the lower limit of the 95% CI for the difference between two means will be more than 0, and it will be more than 1.0 (unity) for RR or OR.

Calculation of confidence interval

In general, for any statistics that has a normal sampling distribution (such as the difference between means or proportions), a CI is constructed by adding to or subtracting from an estimate, a multiple of its standard error 18. For example, a 95% CI is given by

statistics ± 1.96 SE

where SE refers to the standard error of the statistics, and 1.96 is the critical Z value (Zc) for P = 0.05 (or 2.58 for 99% CI interval). The value of the Zc does not depend on the sample size.

CI for a single proportion can be calculated using this principle:

CI = proportion ± Zc SE

The SE for a single proportion (p) from a sample with size n is:

SE = [p(1-p)/n]1/2

Thus, 95% CI of a single proportion becomes 43:

p ± 1.96 [p(1-p)/n]1/2

As seen in this formula, sample size has an inverse relationship with the CI. Larger samples yield a narrower CI. It can be shown mathematically that p(1-p) can simply be replaced by 0.5. Whatever the value of p may be, p(1-p) will never be greater than 0.25. If we use 0.5/Ö n, we will be using a number equal to or larger than the real SE in the CI formula. If anything, this could only result in a wider CI.

A 95% CI of the difference between two proportions is calculated as 43:

Pdiff ± 1.96 SEdiff ........... SEdiff = 1/n [(b+c)-((b-c)2/N) ]1/2

(b and c are from the 2x2 table; N is the grand total in the 2x2 table)

An alternative formula for 95% CI of the difference between two proportions (p1 from n1 samples, and p2 from n2 samples) 6;18:

(p1 - p2) ± 1.96 [p1(1-p1)/n1+p2(1-p2)/n2]1/2

When a difference in the frequency of a gene or genotype is reported, the 95% CI for the difference should accompany this finding together with a P value rather than CIs of each mean separately 4;43. It is now possible to analyze a 2x2 table and obtain OR together with its CIs online (see the end of the document for the link).

Relative risk / Odds ratio calculation

The calculation of RR conferred by an HLA antigen / haplotype / genotype is usually done by Woolf's method 19 which was later modified by Haldane 44. Woolf defined the relative incidence (RI) as follows:

RI = ad / bc (a, b, c, d are the entries in the 2x2 table presented above)

Conventionally, RR is used in HLA and disease studies instead of relative incidence 5. To confuse the terminology further, the cross-product ratios described above as RI actually gives what is called odds ratio (OR) in epidemiology. This discrimination is not always made properly in publications on HLA associations. In the study of HLA specificities, some of the cell frequencies in the 2x2 table may be very small or even zero. If all the patients are positive for the HLA specificity or if none of the controls has it, the denominator would be zero. In this case, the RR would be undefined. For such situations Haldane modified Woolf's formula for RR:

RR = (2a+1)(2d+1) / (2b+1)(2c+1)

More precisely, the RR in a prospective epidemiological study is defined differently (this is the real RR as it is used in epidemiologic studies]  6;13;29. In the 2x2 table, the proportion of persons with the risk factor (allele i in this case) having the disease is a/(a+c), and the corresponding proportion for those lacking the risk factor is b/(b+d). Because relative risk is obtained by dividing the risk of development of disease among subjects with the risk factor by the risk of development of disease among subjects without the risk factor:

RR = (a/(a+c)) / (b/(b+d)) which equals to RR = a(b+d) / b(a+c)

Especially for small values of a and b compared to c and d, ad/bc (which is the OR) is a close approximation 1;6;13. If the probability of an event is p, then the odds of that event is p/(1-p). The OR is the ratio of the odds of the risk factor in a diseased group and in a non-diseased (control) group (RR is the ratio of proportions in two groups). OR is more appropriate for retrospective case-control studies 1;6. In retrospective studies, the selection of the subjects is based on the outcome (the rows) such as patients and healthy controls. In prospective studies, however, it is based on the characteristic defining the groups (the columns, presence or absence of an HLA type). Thus, in a prospective study, it is known how many individuals in the risk group developed the disease. In a retrospective study, RR would not be a precise estimate as the risk of the outcome cannot be evaluated (because of the way the subjects were sampled). In the above RR formula, if a and b are small as usually are (hence the choice of retrospective design because of the rarity of the disease), RR is approximated to OR (ad/bc). In other words, both values of (1-p) for exposed and non-exposed groups will be close to 1 and can be ignored. This is why in both retrospective and prospective HLA and disease association studies, ad/bc is calculated and quoted as either OR or RR. In cohort studies, however, and especially when the outcome is frequent enough (>10%), the OR does not approximate to RR 45. Under the null hypothesis, the expected value of RR or OR is 1.0 (unity).

Several tests have been described to compute homogeneity of odds ratios for stratified data (combining estimates of the odds ratio) 17;20;21;46-48. The most popular of these is the Mantel-Haenszel test for the common odds ratio for stratified data 17;20;21. A Chi-squared test for homogeneity of ORs over different strata can be performed using Woolf's method 17.

Confidence interval for relative risk or odds ratio

As mentioned above, for variables with a normal sampling distribution, we need the SE of the statistics to calculate the CI. RR or OR itself does not have a normal sampling distribution, therefore, to achieve approximate normality, it has to be transformed. Since the natural logarithm of the RR or OR (lnRR or lnOR) has normal sampling distribution, the same principle can be applied to it. It can be shown mathematically that the standard error of the transformed RR is equal to 18:

SE (lnRR) = (lnRR) / Ö C2

Substituting this into the general equation gives the 95% CI for lnRR as:

lnRR ± 1.96 (lnRR) / C

After some mathematical transformation which simplifies to:

RR(1 ± 1.96/C ) and similarly OR(1 ± 1.96/C )

A general formula for the calculation of CI of RR / OR at any significance value is:

RR(1 ± Zc/C ) or OR(1 ± Zc/C )

Confidence limits obtained with this approach are reasonably good approximations to the exact limits, especially when RR / OR is not too far from unity. An alternative approach to calculating the CI for OR is as follows 17:

e ln(OR) ± Zc [1/a + 1/b + 1/c + 1/d]1/2

Estimation of haplotype frequencies

A haplotype contains a minimum of two loci on the same chromosome. If these two loci are not in linkage disequilibrium (LD), it can be said that the frequencies of their alleles are independent of each other. In this case, the expected frequency of a two-locus haplotype can be calculated as the probability of the occurrence of two independent (or joint) events simply by multiplying their gene frequencies. Th