POSSIBLE MISUNDERSTANDINGS AND
MISCONCEPTS IN GENETICS
M.Tevfik Dorak, MD, PhD
Please update your bookmark: http://www.dorak.info/genetics/misund.html
* The presumed gene count of 100,000 for human genome drastically went down to around 20,000 after the completion of Human Genome Project. However, ~20,000 is the number of protein-coding genes. As of the July 2013, the total number of genes (including pseudogenes and non-coding RNA genes) is around 58,000 with the total number of transcripts reaching almost 200,000 (Gencode).
* SNP is used to mean any sequence variation but this is not true. SNPs may be the most common type of variation, but there are many other kinds: most importantly structural variation (see Feuk, 2006) and copy number variations (see Redon, 2006). There were known sequence variations before the term SNP was first used but they were called different things. No one should think that there was no known variation in the genome before the term SNP was first used.
* A SNP association with a trait is usually attributed to the changes in the activity of the gene where the SNP is located. Given long-range effects of genetic variation, this assumption may not be right unless documented experimentally. Most SNPs show eQTL effects on genes far away from them (in the case of rs2395185, it shows trans-eQTL effect on AOAH, which is on a different chromosome (Fehrmann, 2011 & Fairfax, 2012)).
* When a genetic association is found, it is very tempting to speculate that the associated variant may be a good biomarker (usually because of the small P value accompanying the association). This is an almost implausible scenario and even a good odds ratio in the range of 5 to 10 does not mean a single variant can be a good biomarker (see Wald, 1999).
* Mutation involves any change in the hereditary material: from a point mutation to a chromosomal loss. To have a functional consequence, a mutation does not have to be in the coding region. An intronic mutation may well result in a non-functional gene (like the splicing site mutation in CYP21A2).
* Many authors use the term mutation for any rare allele (<1%) and the term polymorphism for any common allele (>1%). This is one definition of mutation and polymorphism. The other one is that ‘mutation’ is any variation in the gene that causes an obvious change in phenotype whereas polymorphisms do not cause any obvious phenotypic variation. It is best to be aware of these definitions while sticking with the recommendations of the Human Genome Variation Society and to use 'sequence variant', 'alteration' or 'allelic variant' for any genomic change regardless of their frequency or phenotypic effects.
* Even 5M SNP chips may not be able to cover the whole genome. One reason for that will be those genes that need to be selectively amplified first (due to the presence of a duplicated copy or pseudogene) will not be represented at all. Highly polymorphic regions (such as HLA genes) are not represented either due to difficulty with designing primers because of the lack of constant regions flanking the variants.
* Chromosomes do not have the shape of what they are most frequently illustrated as. Those figures refer to a metacentric 'replicated' chromosome.
* Base-pair (bp) is used to quantitate the length of nucleic acids but it should really be used for DNA only since RNA is single-stranded.
* Expressivity and penetrance are very different concepts. Expressivity is the variation in the expression of a trait or a disease (phenotypic heterogeneity). Gaucher disease and neurofibromatosis are examples of variable expressivity. Penetrance refers to frequency of expression of a genotype regardless of severity of the phenotype. Low penetrance genotypes will only be expressed in a small frequency of individuals bearing them (as in acute intermittent porphyria). This expression may still vary in its clinical severity (this is expressivity) (See Zlotogora J. Genetics in Medicine 2003).
* If different alleles of a locus cause the same disease, this is called 'allelic heterogeneity' but if they cause different diseases/phenotypes, this is also called 'allelic heterogeneity'. This term should be used unambiguously.
* Genetic heterogeneity and locus heterogeneity are used interchangeably in practice but this requires attention. 'Locus heterogeneity' is only used for the involvement of different loci in the causation of a disease/phenotype individually (as in early-onset Alzheimer disease; three different genes may cause the same phenotype). 'Genetic heterogeneity' may also be used to mean a combined effect of different loci in the development of a (complex) disease (as in diabetes; multiple loci are 'simultaneously' involved in the development of diabetes).
* There is a difference between a trait being influenced by genes and trait variation being influenced by genetic variation. See Terwilliger & Weiss, 2003.
* Syntenic means two different things in different contexts: It is (syntenic genes) described as "genes thought to reside on the same chromosome" in Dictionary of Genetics by King & Stansfield (see for example a Lecture Note on Linkage & Recombination); and (syntenic maps) as "genetic loci that lie in the same order on the same chromosome in different species" in Dictionary of Biological Terms by E Lawrence (see for example, NCBI Human-Mouse conserved synteny maps or Human chromosome 22 and syntenic Mouse chromosome 16 maps). Both are correct but this dual usage may cause misunderstandings. Lee Silver explains both in the same entry in the Encyclopedia of Genetics: synteny describes two or more genes or loci that have been mapped to the same linkage group. Conserved synteny refers to the situation where two linked loci in one species have homologs that are also linked in another species.
* Genetic distance may also mean two different things: (1) in population genetics, genetic distance is a measure of genetic relatedness of any number of species or populations, (2) genetic distance is also a measure of recombination (the parameter is centriMorgan or cM) which roughly correlates with physical distance (the number of nucleotides between two loci) in the human genome.
* It is important to distinguish segments identical by descent (IBD) from those identical by state (IBS). IBS alleles look the same, and may have the same DNA sequence, but they are not derived from a known common ancestor. Alleles IBD are demonstrably copies of the same ancestral (parental or grandparental) allele. If two sibs each have the same allele, the shared allele is IBS, but it may or may not be IBD.
* At the genomic level, it is often quoted that there is 99.8% similarity in certain coding regions between humans and chimpanzee genomes. Remember that:
1. One nucleotide (out of 3 billion) difference may cause lethal diseases [sickle cell disease, hereditary hemochromatosis, etc]
2. Just one gene 'SRY' of the Y chromosome is responsible for almost all the difference between a male and a female
3. On the other hand, at the sequence level any two human subjects differ from each other by 0.1%. This corresponds to 3 million nucleotide differences. It is not the number of differences but the nature and location of differences that matter. And most importantly:
4. Any conclusion drawn only from linear DNA sequence comparison ignores the effects of epigenetic differences, posttranscriptional and posttranslational effects.
(See a discussion in Scientific American; Marks J: What It means To Be 98% Chimpanzee. University of California Press, 2002, Diamond J: The Third Chimpanzee. Perennial, 1994; What Makes us Humans? in Human Molecular Genetics; and Science Magazine Breakthrough of the Year 2005: Evolution in Action.)
* Humans are not descendants of apes. They share a common ancestor who lived about 5-7 Mya and is now extinct. This is similar to say that modern humans are not descendants of Neanderthals but they shared a common ancestor lived about half a million years ago.
* Different species of humans have been recognized (H. erectus, H. habilis etc). This does not mean these ‘species’ lived contemporarily and could not interbreed. This operational classification is based on structural differences and not on genetic isolation.
* Mendel's experiments were on characters determined by single genes. Multiple genes and environmental factors that interact with one another determine most characters. Quantitative genetics deals with such characters or complex diseases. See also ‘Some apparent exceptions to Mendelian rules’.
* A quantitative locus is involved in the expression of a continuous character like weight or height but not a countable one.
* Multivariable analysis and multiple comparisons have nothing in common. Performing multivariable adjustment (adjusting the odds ratio/P value for potential confounders) does not adjust for multiple comparisons. Also, multivariable analysis result can go either way: it is not fair to state that “even after adjustment in a multivariable model, the association remained significant.” By the way, multivariable and multivariate are not the same thing and they have totally different meanings (see Biostatistics Glossary).
* Dominance and recessiveness are the features of phenotypes but not genes. It is more common to call genes dominant or recessive but strictly speaking, this is wrong.
* Dominance models in genetic epidemiology refer to associations with heterozygosity and ‘dominance’ here is very different from ‘dominance’ used in classic genetics.
* Dominant and recessive refer to the nature of alleles that determine dominant or recessive phenotypes. They have nothing to do with the frequency of an allele. In other words, a common allele is not necessarily a dominant allele.
* DNA and RNA are traditionally called nucleic acids. The so-called 'nucleic' acids include extra-nuclear (cytoplasmic) DNA and tRNA/rRNA too. In other words, mitochondrial DNA is also a nucleic acid but it is not in the nucleus. Similarly, viral DNA or RNA are nucleic acids but not enclosed in a nucleus.
* DNA is deoxyribonucleic acid composed of four nitrogenous bases (A, T, G, C), deoxyribose and acidic phosphate groups. It is the phosphate group that makes it acidic.
* Start codons AUG/GAG code for methionine and valine. It does not necessarily mean that each polypeptide starts with Met/Val because most of the time it is eliminated by post-translational modifications.
* The "stop" codon in the genetic code is a signal for the end of peptide synthesis in an open reading frame, not the end of transcription.
* There is redundancy in genetic code in that for some amino acids, there is more than one codon available. Although there is redundancy in the code, there is no ambiguity as to which codon codes which amino acid. The specificity is achieved at the tRNA level. Thus, each codon in the DNA sequence one and only one amino acid.
* Genes are said to be transcribed 5' to 3'. This is the direction on the coding (sense) strand but it is actually the non-coding or template strand which is transcribed and this happens 3' to 5'. The resulting mRNA is made 5' to 3'.
* Antisense treatment is targeted against the mRNA (which is always a sense strand) but not against the sense (coding) strand of DNA.
* Beadle’s one gene-one enzyme/protein concept is essentially correct but not strictly valid any more. Alternative splicing, overlapping genes, posttranslational modifications and other mechanisms create more than one protein product from a given sequence in the genome. This is similar to the fact that same gene may cause more than one distinct clinical syndromes (due to different mutations) (for more on this, see Clinical Genetics). One reason for the estimated number of genes in the human genome was initially so high (around 100 thousand) was strict interpretation of one gene-one protein concept.
* The number 30-35 thousand for functional genes in human genome is only for genes identified structurally by conventional criteria. It does not include the non-conventional genes such as small RNAs, transcribed/processed pseudogenes, alternatively spliced versions and some of the overlapping genes. Total number of proteins encoded by the human genome is many times the number of structurally recognized genes.
* Confusion may arise when two different numbers are quoted for the number of genes in a genome. One has to be specific about what is mentioned. Total number of genes (loci) and number of protein-expressing genes are different in a genome. Thus, number of polymorphic markers is (millions) far too much more than the number of genes (which is 30-35 thousand in the human genome).
* Pseudogenes can be transcribed (examples are CYP21A1P and DRB4-null). Although not always translated into a protein product, a pseudogene can be transcribed to a RNA product and this can be involved in gene expression regulation (for an example, see Hirotsune et al, 2003). This is similar to transcribed but untranslated parts of a conventional gene. Pseudogenes and processed pseudogenes should also be distinguished (for more, see http://pseudogene.org).
* DNA is not the blueprint for life. It can be said that the DNA contains the biochemical instructions a living organism will need. Think about sex determination in reptiles or the penetrance issue for a cancer susceptibility gene. The epigenetic variation creates difference between individuals bearing the same sequence of DNA (see the chimpanzee vs. human sequence similarity discussion above).
* Leader sequence and signal sequence are used interchangeably as if they were the same thing. Although this is common practice, strictly speaking, leader sequence is only transcribed but not translated and it leads the mRNA to the ribosomes; signal sequence is translated and helps the protein to reach its final destination (this may be outside the cell) where it is cleaved off.
* Another unfortunate common practice is using allele/gene/antigen/phenotype/marker frequency interchangeably. For a brief discussion of their exact definition, see: Statistical Analysis of HLA and Disease Associations. One thing that is more than just careless usage is the use of ‘carrier / carriage frequency’ instead of marker frequency. Carrier frequency has a specific meaning in genetics (as in carrier frequency for thalassaemia trait) and should not be used to describe the proportion of tested subjects positive for a marker (heterozygous and homozygous combined) when marker frequency is the more appropriate term (for Reference, see Svejgaard & Ryder, 1994).
* A huge and unforgivable mistake is to compare allele (gene) frequencies (corresponding to multiplicative model) with marker (allele positivity) frequencies (corresponding to dominant model) in association studies (and usually to find a very significant association!). This is not as rare as one might think (or hope).
* Polymorphism has several definitions. A polymorphic locus was originally defined as a locus in which the least common allele occurs with a frequency of at least 1%. A more appropriate definition has been suggested by Elston as a locus in which the most common allele occurs with a frequency of at most 99% (Elston, 2000). The original definition would fail to accommodate the HLA loci which have >100 alleles (although they are cumulative product of multiple polymorphic sites within each gene) but Elston's definition allow for more than 100 alleles.
* A functional polymorphism may either increase or decrease gene expression. This means not all polymorphisms decrease gene expression/function. This is similar to say that not all polymorphisms are deleterious or associated with risk, they can be markers for risk or protection.
* A genetic association may be a chance finding (due to sampling error) and opposite is also true that the lack of it may be a chance finding (i.e., an association may be obscured due to sampling error). However, no reviewer will tell you that failure to replicate an association might have been due to the lack of chance! Population stratification or any bias may also work either way but are usually suspected or has to be ruled out when an association is found.
* In PCR, dNTPs are used but in the final product, the nucleotides are dNMPs. The two phosphates are cleaved off to obtain energy to drive the reaction (DNA replication).
* Meiosis is said to create four daughter cells out of one. This does not apply to an oocyte, which gives rise to a single daughter cell (ovum) and two polar bodies. In other words, an ovum's chromosome content is halved only after fertilization.
* It is quite common to name the growing human offspring as embryo or fetus regardless of the period of intrauterine life. The correct terminology for the offspring: [fertilization] zygote - conceptus - embryo (after implantation) - fetus (after organogenesis is complete) [birth]. (Plural for conceptus is either concepti (Latin) or conceptuses (English)).
* 'Murine' refers to the rodent family Muridae, which includes both rats and mice. By common practice, however, the term is used almost exclusively for mice. If you mean mice, say mice.
* Evolution does not prearrange what action to take. It proceeds by natural selection in which individuals with the most adaptive characteristics in a given environment are selected (favored). Over many generations, the number of individuals bearing the adaptive characteristics increases until all individuals have them.
* Genetic fitness is the overall ability to leave surviving offspring who themselves will be able to reproduce successfully. It has no correlation with physical fitness. Also reduced fitness does not necessarily involve death; there are a lot of long-living people but infertile.
* Heritability may be a high in a given population for a given character/disease but this does not mean environment does not play any role in the expression of that phenotype. If the same estimate is attempted in a different population, it may not be that high.
* Ethical issues aside, eugenics is a fallacy. Even if all individuals expressing a recessive disease are eliminated from the population (most of which are not fertile anyway), there will still be about 100 times more asymptomatic carriers of the gene for the same disease. Unless based on a very comprehensive genetic screening program, a eugenics program will never achieve its aim.
* It is now possible to type thousands of polymorphisms of the genome in a single assay using microarray technology. Every time a new susceptibility marker is announced, there is a lot of talk about the possibility of an insurance company’s use of it. If we are all screened for all of our genes, all of us will have a few recessive lethal genes and a lot of susceptibility markers for complex diseases. No one will ever be clear of all susceptibility genes. The fallacy is that having the nucleotide sequence for a ‘bad’ gene does not mean it will do any harm.
* Susceptibility and predisposition are usually used interchangeably but most genetic epidemiologists have begun to mean different things by these two words. Predisposing genes are those high-penetrance genes that are necessary and sufficient to cause a disease and susceptibility genes are low-penetrance genes that are neither necessary nor sufficient to cause a disease. Susceptibility genes contribute to disease development in a multifactorial setting but the disease can occur in their absence (Greenberg, 1993; Greenberg & Doneshka, 1996).
* Nick translation is not actually a translation event in classical sense. It is the replication of DNA by a polymerase.
* Gene expression is the process that converts a gene's coded information into the structures operating in the cell. Expressed genes include those that are transcribed and translated all the way to proteins, and those that are transcribed into RNA but not translated into protein (e.g., transfer and ribosomal RNAs). XIST and H19 genes are transcribed but not translated to a protein product (Joubel, 1996; Milligan, 2002). Thus, gene expression should not be described as conversion of genomic information to protein sequences.
* Nucleotide level changes are only one of many phenomena that affect gene expression. Cis- and trans-acting modifiers, epistatic interactions, gene-environment interactions, parent-of-origin, sex-specific imprinting and other epigenetic effects, post-translational modifications and many others are involved in gene expression. These are some of the reasons for the lack of strict genotype-phenotype correlations for many genes.
* Gene conversion is a misleading and confusing name for what it describes. It means conversion (transformation) of one gene into another because of exchange of genetic material with another gene.
* Homology, analogy and paralogy are related but different concepts and cannot be used interchangeably. Homology and analogy concern different species (former with a common ancestor) and paralogy concerns the same species (see glossary). Most importantly, most of the time there is no such thing as 'sequence homology' but what is meant by this is 'sequence similarity' (see Similarity, Homology, Divergence and Convergence in the NCBI Online Book: Sequence - Evolution - Function).
* Mitochondrial Eve theory does not say that at some point in the history there was one female member of our species. The difference between gene and individual genealogies is important in the interpretation of the findings led to this theory. All genes eventually coalesce into a single common ancestor. The mitochondrial Eve was not necessarily the only female alive at that time, or the first female to have that particular type. Many other mtDNA lineages may have gone extinct before or after she lived.
* Linkage, association and linkage disequilibrium (LD) are all different concepts. The HLA-A locus and hereditary hemochromatosis show linkage in families. The HLA-A3 allele shows an association at the population level because of LD between the C282Y mutation of HFE and HLA-A3. Linkage does not imply LD and vice versa. The delta (D) value may be zero for linked loci (linkage only occurring in families and no LD at the population level which would occur if the disease mutation occurred on multiple chromosomes independently), and delta value may be different from zero for unlinked loci. Linkage stems from no recombination within the family (limited number of meiosis events) and LD is a reflection of proportion of recombination events over many generations in the population.
* A “risk marker” does not have to increase the risk, it can be protective. An association is either a risk association or a protective one.
* Unless otherwise stated, a genetic polymorphism association is with the variant (minor, rare) allele of a variant.
* The variant allele of a polymorphism is not necessarily deleterious; it may cause a beneficial change in gene function.
* A protective odds ratio (<1.0) may be said to decrease the risk by (1-OR)x100 percent but a risk OR cannot be translated to a percentage as frequently done. This is because protective odds ratios lie between 0 and 1 but risk odds ratios lie between 1 and infinity.
* Model-based linkage analysis is based on a likelihood ratio, the logarithm of which is called a LOD score. This is not the logarithm of the odds for linkage but the logarithm of the likelihood ratio for a particular value of the recombination fraction vs. free recombination, i.e., q (theta) = 0.5 (Elston, 1998; Olson, 1999).
* Linkage disequilibrium (LD) and Hardy-Weinberg equilibrium (HWE) are also different things and have not got much in common. LD is -not having- linkage equilibrium, which is quantitated by a delta value and an associated P value shows the significance of the disequilibrium. In HWE tests, getting a significant P value also means disequilibrium, which is a worrying thing when the population sample is supposed to be in HWE (like the control group of an association study).
* Major assumption of HWE is random mating while there is no random mating in any human population. How many tall women do you know are married to short men? The popular program Haploview uses P = 0.001 as the threshold for Hardy-Weinberg equilibrium violation probably in recognition of the unrealistic assumptions of HWE.
* The definition of haplotype does not require having two markers on the same chromosome. For the HLA system, for example, HLA-B44-DR4 is a haplotype flanked by the genes encoding B44 and DR4 but it is also possible to talk about DR4 haplotypes. This would mean an undefined length of chromosomal segment carrying DR4 gene extending either way.
* Cancer is a genetic disease but the genetic changes concern somatic cells. Cancer due to germline DNA mutations (inherited cancer) is less than 10% of all cancers. Genes are usually the target, not the origin, of the cancer process. Most mutations found in cancer cells (somatic) cannot be detected in surrounding healthy cells (germ-line).
* Germline DNA mutations that cause inherited cancer are both necessary and sufficient for cancer development and have high-penetrance. The more common low-penetrance, germline polymorphisms that modify cancer risk are part of the complex genetic susceptibility and do not cause cancer on their own.
* Oncogenes should be called proto-oncogenes. Anti-oncogenes are not genes that antagonize the effects of the oncogenes. They are the genes with anti-oncogenesis effects. Similarly, carcinogens (or cancerogens) are not necessarily cancer-causing genes, they are more frequently chemicals or other environmental factors (e.g., viruses) which promote cancer development.
* The original Knudson's two-hit hypothesis (Knudson, 1975) suggesting loss of heterozygosity or homozygous deletion are the two hits required for the loss of tumour suppressor gene activity has now been extended to include transcriptional silencing by DNA methylation of promoters that can disable tumor-suppressor genes ( eviews by Yu & Shen, 2002; Balmain et al, 2003 and Paige, 2003).
* X-linked diseases (eg, hemophilia, Wiskott-Aldrich syndrome) occur almost exclusively in males but these are recessive ones. X-linked dominant diseases are not seen in males because they are lost during fetal development (see Clinical Genetics).
* Some X chromosome genes escape inactivation (Shapiro, 1979). This creates a situation, which is opposite of what is achieved by Lyonisation (gene dosage compensation): X chromosome genes that escape inactivation are represented twice as much in females (eg, MIC2X/CD99; STS; XG). Such genes are either in pseudoautosomal region of the X and Y chromosomes or they have a homologous gene on the Y chromosomes (eg, ZFX/ZFY).
* Not all genes on X-chromosome behave like sex-chromosome-linked genes. Those within the pseudoautosomal regions (PAR) exist in two copies in both sexes (see pseudoautosomal region in Genes and Chromosomes).
* Does anyone have an idea about the following (or would like to do research on these)?
1. Why is it that primary sex ratio at fertilization may be as high as 165:100 (see for example: Tricomi, 1960; Shettles, 1964; Serr & Ismajovich, 1963; Lee & Takano, 1970; McMillen, 1979; Kellokumpu-Lehtinen & Pelliniemi, 1984; Vatten, 2004; C3 Newsletter 13/2) but it falls down to 106:100 at birth in humans (and similarly in most mammals) but nobody thinks about the reasons/implications of this? A continuation of this process (elimination of excess males) is the increased morbidity and mortality of male infants and children (well-known male disadvantage (Stevenson, 2000) or fragile male (Kraemer, 2000), which has evolutionary explanations (Trivers & Willard, 1973; Wells, 2000; Dorak, 2002)). Could the sex-chromosomes be involved? [One reason must be the male-specific lethality associated with X-linked dominant diseases but this is not frequent enough to explain the huge loss.]
2. Why is everybody acknowledging that the carrier frequency for HFE-C282Y (causing hereditary hemochromatosis) and CYP21A2 mutations (causing congenital adrenal hyperplasia) is so high (more than 1 in 50 people carry them in some populations) but not much is done about their implications in public health?
M.Tevfik Dorak, MD, PhD
Last updated on 24 March 2014