Genetics      Evolution     HLA     MHC     Epidemiology     Genetic Epidemiology     Population Genetics    Glossary     Homepage




M.Tevfik Dorak, MD, PhD


PowerPoint Presentation on Gene Expression


The Central Dogma of molecular biology states that RNA is made using DNA templates and that protein is made using the information in RNA (genome encodes transcriptome which is translated to proteome. See a List of OMEs). Although this sounds like stating the obvious, it was an important principle when discovered. An exception is the reverse transcription, which allows DNA to be made from RNA (another exception may be the prions). The flow from DNA to protein involves transcription and translation with possible modifications after each step. Expressed genes include those that are transcribed and translated all the way to proteins, and those that are transcribed into RNA but not translated into protein. Transfer and ribosomal RNAs, XIST and H19 genes are transcribed but not translated to a protein product (Joubel, 1996; Milligan, 2002). Thus, gene expression should not be described as conversion of genomic information to protein sequences. Another classic concept in genetics is the principle of one gene-one peptide. This also had to be revised because it is possible that one gene can produce a variety of products.


Summary of gene transcription and translation

Genes are transcribed from 5' to the 3' of the sense/coding strand by RNA polymerase II. It is actually the antisense or template strand, which is transcribed (3' to 5') and gives a strand identical to the sense strand. The initial product of transcription is RNA, more precisely heterogeneous nuclear RNA (hnRNA). The four bases in the RNA polynucleotide chain are A, G, C, and U (uracil instead of thymine, which is demethylated thymine). Base pairings in RNA are usually A-T and G-C, but weaker G-U pairings can be identified occasionally. The RNA copy of the DNA that is to become a polypeptide is called messenger RNA (mRNA). mRNA sequence is identical to the coding (sense) strand of DNA. The starting point of transcription occurs at the promoter region of the gene. The promoter is recognized by transcription factors (DNA-binding proteins) that control the rate and degree of transcription. The information for promoter function is provided directly by DNA sequence, and its structure represents the actual signal. Most promoters have a TATA box located approximately 25-35 base pairs upstream of the transcription initiation site. The TATA box tends to be surrounded by GC–rich sequences. The fixed position of the TATA box is critical to the positioning of the RNA polymerase. Another conserved sequence approximately 70 bases upstream of the transcription initiation site is the CAAT box. Other upstream elements are known to modify the rate of transcription. From the promoter region, the RNA polymerase enzyme travels along the template, synthesising RNA, until shortly after a termination sequence (stop codon). For a short stretch, as the RNA polymerase moves along the unwinding DNA molecule, the DNA is in a single-stranded conformation. As soon as the enzyme has passed by, the DNA duplex reforms. In this fashion, the integrity of the DNA molecule is maintained during transcription. The nucleic acid sequence of the gene includes not only the coding sequence (CDS or exons), which corresponds to the amino acid sequence of the protein, but also additional intervening (non-coding) sequences (introns). The exons are the regions of the gene represented in the processed mRNA and introns are regions that are spliced out of the processed mRNA. This processing of the gene and RNA occurs in the nucleus. In addition to splicing out of the introns, a series of up to 200 adenine residues is added to the terminal end of the mRNA molecule (polyadenylation). The processed mRNA is then transported to the cytoplasm and translated into protein by the ribosomes. Each ribosome consists of two subunits, which contain several proteins in association with a long RNA molecule known as ribosomal RNA. The generation of the polypeptide is mediated by yet another small RNA species, transfer RNA (tRNA). A tRNA is able to recognize only the one amino acid to which it is covalently linked. The tRNA contains a trinucleotide sequence, the anticodon, which is complementary to the codon that represents the amino acid. The anticodon allows the tRNA to recognise the codon through complementary base pairing. The ribosome moves along the mRNA, permitting sequential amino acids to be assembled into protein. The order of the amino acids in a protein confers its ultimate three-dimensional conformation and its biological and biochemical activities. Alteration of the amino acid sequence due to missense nucleotide substitution changes these characteristics.


Regulation of gene expression involves the following steps:

Transcriptional control: how and when a gene is transcribed (including epigenetic control)

RNA processing control: how the RNA transcript is processed (including small nuclear ribonucleoproteins, snRNPs)

RNA transport control: which mRNA transcripts are exported to the cytoplasm, and where they reach in the cytoplasm

mRNA stability control: selective degradation of some mRNAs

Translational control: selecting which mRNAs in the cytoplasm are translated by ribosomes (including regulation by microRNAs)

Post-translational control:  selectively activating, inactivating, degrading or compartmentalizing particular specific proteins (including regulation by ‘small interfering’ siRNAs).


It is possible that a gene is encoded on the sense strand and another one on the anti-sense strand in opposite direction (overlapping genes; for an example, see CYP21A2 and TNXB; Tee, 1995). Overlapping genes are particularly frequent in viruses and plasmid / phages who need to pack a lot of information in a small, compact genome (HIV is an example). Degeneracy of the genetic code facilitates the presence of overlapping genes. See also Michael W. Pfaffl’s Page for an excellent review of gene structure & expression as well as molecular genetic methods to quantitate gene expression, including microarray methods supplemented by videos.


Components of genes playing roles in their expression

In molecular terms, a gene is the entire DNA sequence required for synthesis of a functional protein. In addition to the coding and intervening sequences, a gene also includes transcription-control region. Untranslated regions, exons and introns make up of a complete gene.

Enhancers: An enhancer is a cis-acting regulatory element on either 5’ or 3’ flanking region of a gene that stimulates a specific promoter. It is not transcribed. Enhancers are usually at some distance even from the distal promoter region. In some cases, distant enhancers are important for developmental expression (eg, globins), cell-specific expression (eg, immunoglobulins), or hormonal regulation (eg, glucocorticoid response elements). Enhancers are binding sites for transcriptional activators (activating transcription factors) such as AP-1, AP-2, Oct-1, GATA-1, P53, NF-kB. Certain genes have silencers that do the opposite of enhancers.


Promoters: The region in the close vicinity of the transcription initiation site at the 5' end of a gene. The promoter region is not transcribed. Promoters are the initial binding sites for RNA polymerases. Transcription factors bind to the promoters and allow RNA polymerase to act. The proximal region of the promoter is generally packed with short (5-15 nucleotides) motifs called transcription factor binding sites (promoter-proximal elements). The proximal promoter region contains sites for ubiquitous proteins such as Sp-1, CAAT box/enhancer-binding protein (C/EBP), cAMP response element-binding protein (cAMP), or Activator Protein-1 (AP-1). Factors involved in cell-specific expression may also bind to these sites.


The minimal promoter region that is capable of initiating basal transcription is called the core promoter. The common promoters, TATA and CAAT boxes, are found about 30 bp and 70 bp, respectively, upstream of the transcription initiation site. The TATA box is the most conserved functional signal in eukaryotic promoters. It is present in 30-50% of promoters. Many highly expressed genes contain a strong TATA box in their promoters (TATA+). TATA-less (TATA-) promoters usually belong to housekeeping genes, some oncogenes and growth factors. TATA- genes usually have a downstream promoter element (DSE) located approximately 30bp downstream of transcription initiation site. This group of promoters are much harder to identify compared to TATA+ ones by bioinformatic tools. The region 200-300bp immediately upstream of the core promoter is the proximal promoter full of multiple transcription factor binding sites. Further upstream is the distal promoter region that usually contains enhancers and some transcription factor binding sites. It is the distal promoter region that is most difficult to identify through computational methods because of lack of standard features and high variability (reviews and reports on promoter prediction: Pedersen, 1999, Werner, 1999, Davuluri, 2001, Ohler & Niemann, 2001, Suzuki, 2001 'full text', Hannenhalli & Levy, 2001 'full text', Solovyev & Shahmuradov, 2003; See also Bioinformatics approaches and resources for SNP functional analysis by Mooney, 2005.


Nucleotide mutations outside the protein coding regions but in regions that influence gene expression are called regulatory mutations (Knight, 2005). These are usually promoter region mutations and cause variations in gene expression levels (regulatory SNP -rSNP- finding tool: rSNP Guide). Mutations in the flanking sequences of insulin and collagen type I genes, for example, cause type I diabetes mellitus and osteoporosis, respectively. Some promoters contain neither TATA boxes nor other alternative promoter elements and transcription is often initiated at multiple sites over a defined region.  These promoters are often characterized by having a CG-rich stretch of nucleotides (GC box) within 100bp of the initiation site (see Ioshikhes & Zhang, 2000Antequera, 2003 and Htf islands in glossary). Approximately 50-70% of genes contain GC boxes in their promoters and this feature can be used to predict promoter sequences via bioinformatics methods (Ponger & Mouchiroud, 2002 'full text'; Wang & Leung, 2004). All known regulatory units (promoters, enhancers, silencers) have been compiled in a single database called Transcription Regulatory Regions Database (TRRD; Kolchanov, 2002).


Promoters function to orient RNA polymerase so that the correct strand of DNA is used as the template for transcription. In principle, any region of the DNA double helix could be copied into two different RNA molecules. The promoter of a gene determines which of the two strands is copied. A promoter is an oriented DNA sequence that points the RNA polymerase in one direction, and this orientation determines which DNA strand is copied. The anatomy of a promoter is usually defined by a combination of gene transfer experiments to assess the effects of promoter mutants and studies of protein–DNA interactions such as DNAse I footprinting and the electrophoretic gel mobility shift assay (EMSA) (see below).


Eukaryotic RNA Polymerases (RNAP): As opposed to a single bacterial RNAP, eukaryotes have three different RNAPs (I, II and III) with each one having their own promoters. They add to the 3' end of the growing RNA molecule one nucleotide at a time using ribonucleotide triphosphates (rNTPs) as substrates (this reaction releases pyrophosphates). RNAP I is dedicated to the synthesis of only one type of RNA molecule (pre-rRNA; rRNA is encoded by tandem repeats of genes in the nucleolus). RNAP III produces small RNAs such as all tRNAs, 5S rRNA and a number of small RNAs. RNAP II is the enzyme principally involved in the transcription of genes from DNA sequence into RNA. Unlike DNA polymerase, however, RNA polymerase does not require a preformed primer to initiate the synthesis of RNA. Instead, RNAP is recruited and oriented by promoters. Unlike RNAP I and III, RNAP II recognises many thousands of promoters. Most promoters have the TATA and CAAT boxes. All promoters for genes for housekeeping proteins contain multiple copies of a GC-rich element that includes the sequence 5'-GGGCGG-3' (GC box). Transcription by RNAP II is also affected by more distant enhancers.


The end-product of transcription is heterogeneous nuclear RNA (hnRNA), which represents the primary transcripts of RNA polymerase II. hnRNA has a very wide range of sizes (2-40 kb). The RNA copy of the DNA that is to become protein is the processed product of hnRNA termed messenger RNA (mRNA). Not all transcribed RNA is destined to arrive in the cytoplasm as mRNA. Modifications of primary RNA transcripts include splicing, cleavage, base modification, capping and the addition of poly-A tails. During RNA processing introns are spliced out but sometimes splicing out also involves exons. This is called alternative splicing and results in different protein products (isoforms) from the same DNA sequence (more than one possible exon assembly). Alternative splicing (along with overlapping genes and others) is the reason for the lower than estimated number of genes identified in human genome.


Around 98% of all transcriptional output in humans is non-coding RNA. RNA-mediated gene regulation is widespread in higher eukaryotes and complex genetic phenomena like RNA interference, co-suppression, transgene silencing, imprinting and methylation involve some form of RNA signalling. It is possible that intronic and other non-coding RNAs have evolved to comprise a second tier of gene expression in eukaryotes (Mattick, 2001). Trans-acting RNAs may relay information required for the coordination and modulation of gene expression via chromatin remodelling, RNA–DNA, RNA–RNA and RNA–protein interactions.


A first step in the activation of transcription is the decondensation of chromatin surrounding a gene so that transcription factors and RNA polymerase can gain access to DNA.  Conversely, genes are often turned off by the condensation of chromatin surrounding a particular gene, and the inhibition of RNA polymerase and transcription factor binding. A large number of DNA binding proteins, collectively called transcription factors, regulate cell and tissue specificity of gene expression through their influence on RNA polymerase II-mediated transcription. The sequences to which transcription factors bind are called response elements. There are different classes of transcription factors. The general transcription factors include proteins that are involved in the assembly of the basal transcription apparatus. The sequence specific transcription factors include proteins that bind to DNA regulatory elements in the enhancers or promoters of genes. There are also coactivators and corepressors that bind to other transcription factors and regulate transcription by altering chromatin structure or by making contacts with the basal transcriptional machinery. A transcription factor database on the Internet (TransFac; Wingender, 2000) contains a comprehensive list of transcription factors and allows a promoter region to be searched for recognition sites for sequence-specific transcription factors. Another related internet resource is TFSearch which helps to find transcription factor binding sites in a sequence. Most transcription factors share motif structures such as zinc finger, helix-turn-helix (HTH), helix-loop-helix (HLH), leucine zipper, homeodomain and POU domain (see also Structural Motifs in Eukaryotic Transcription Factors).


An ever-increasing number of diseases are attributed to mutations in transcription factor genes. In general, germline mutations in transcription factor genes result in malformation syndromes and somatic mutations involving many of the same genes contribute to carcinogenesis. Some of the better known transcription factors are: c-JUN, c-MYC, CBP, CREB, E2F1, STAT, SRY, PIT-1, ETS-1, RUNX-2/AML-3, GATA-1-6, PBX2, RFXAP and nuclear factor kappa-B (NFKB). Coactivators and co repressors also take part in regulation of the activity of transcription factors. Diseases caused by mutations in transcription factors include Rubinstein-Taybi syndrome, vitamin D-resistant rickets type IIA, bare lymphocyte syndrome type II, acute lymphoblastic leukemia, acute myeloid leukemia and Down’s syndrome-associated acute megakaryoblastic leukemia. Transcription factors linked to the cell cycle control, RB1 and P53, play major roles in neoplastic development. They are involved in retinoblastoma, osteogenic sarcoma and Li-Fraumeni syndrome and other familial cancer syndromes (Levine, 1991). Changes in transcription factor activity is also associated with neoplastic diseases. In most cancers, translocations combine different transcription factor domains or bring a transcriptional activation region under the control of a heterologous gene. One example is the t(1;19)(q23; p13) translocation in pre–B-cell acute lymphoblastic leukaemia, which fuses the PBX homeodomain gene to the transactivation domain of the E2A transcription factor gene.


Transcription factors also regulate gene expression by their effect on histone acetylation. Histone acetylation may create a more open chromatin configuration that allows other transcription factors to gain access to the regulatory regions of a gene and increases gene transcription. The transcriptional coactivator, CREB-binding protein (CBP), possesses intrinsic histone acetyltransferase (HAT) activity. Other transcription factors (often repressors) recruit histone deacetylases.


Transcription initiation (cap) site: This is where the transcription of DNA to immature (precursor) pre-mRNA starts. It is the 5' end of the coding sequence; the beginning of the first exon. This sequence adds a 7-methylated GTP (7mG) cap to the beginning of the mRNA (to protect it against the activity of 5'-exonuclease). From here to the translation initiation site, the sequence codes for the 5'-untranslated region (UTR) (also called ribosome-binding region) and the signal peptide. The 5'-UTR is not featured in the final protein product. It contains the site (the leader sequence) at which ribosomes initially bind to mRNA to start translation. The signal sequence is encoded in the first or second exon and translated at the N-terminal. It provides the signal for correct cellular location (endoplasmic reticulum, Golgi apparatus, cell membrane, etc) or outside the cell through the cell membrane, and is finally cleaved off by a metalloproteinase). Mutations interfering with cleavage of signal peptide result in human diseases such as factor X deficiency (due to a substitution of arginine by glycine at position -20 {numbering the alanine at the NH2-terminus of the mature protein as +1}, and autosomal recessive hypoparathyroidism due to a mutation substituting serine with proline at position -3 in the signal peptide of the prepro-parathyroid hormone gene.


The 5’ UTRs of most mRNAs contain a consensus sequence of 5’-CCAGCCAUG-3’ involved in the initiation of protein synthesis. The events that occur to mRNA before it leaves the nucleus are collectively called RNA processing or post-transcriptional modification (capping, cleavage, base modification, polyadenylation and splicing). See a review on computational detection and location of transcription start sites by Down & Hubbard, 2002 'full text').


Translation initiation site (ATG / AUG): This sequence represents the beginning (N-terminal) of translated proteins. (5' of DNA codes for N-terminal of a polypeptide.) It codes for a methionine but methionine is frequently subject to post-translational elimination. Thus, each mature mRNA's first codon is for methionine (AUG) but not all polypeptides start with methionine. Translation takes place in the ribosome in the cytoplasm. According to the official Human Gene Nomenclature rules the 'A' nucleotide of the ATG is nucleotide number +1, and all other sequence variation should be numbered using this nucleotide as reference. The nucleotide 5' to +1 is numbered -1; there is no base 0 (see Nomenclature Page). The sequences prior to translation initiation site comprise the 5' untranslated region (5' UTR) and the most 5' base within the 5' UTR constitutes the transcriptional start site (designated as +1). The 5' UTR varies greatly in length in different genes, and it may contain several exons. For this reason, it can be challenging to identify the location of the promoter, even when the coding region of the gene has been found.


Exon-intron boundaries: Each intron starts with GpT (5' splice site / donor) and ends with ApG (3' splice site / acceptor). The introns are subject to splicing out during post-transcriptional modification. The highly conserved intronic 5'GT and 3'AG sequences are essential for correct splicing. These are called splicing sites and mutations of these nucleotides cause aberrant splicing. U1 and U2 small nuclear RNAs bind to the conserved splicing sites in introns to mediate splicing out. A number of diseases are caused by splicing site mutations one of which being congenital adrenal hyperplasia (CAH) caused by an intron 2 splicing site mutation (Higashi, 1988) (many other mutations also cause CAH). Most of beta thalassaemia disorders are caused by splicing site mutations. An autosomal dominant form of isolated growth hormone deficiency is caused by mutations in intron 3 of the GH1 gene that cause exon 3 skipping, resulting in truncated products of the GH1 gene that prevent secretion of normal GH. Although the introns are not represented in the resultant polypeptide, they may contain some regulatory sequences. An example of this is the intron 35 of the MHC class III gene C4 (complement component 4), which contains promoter activity for the gene lying next to it CYP21A2 (21-hydroxylase).


Stop codon: One of the three strop codons (UAA, UGA and UAG) provides the termination signal for translation. (The presence of three different stop codons is an example of degeneracy of genetic code.) The triplet before the stop codon codes for the last amino acid of a polypeptide chain (C-terminal). (3' of DNA codes for C-terminal of a polypeptide.)


Polyadenylation signal: The polyA signal (usually 5’-AAUAAA-3’) is after (downstream to) the stop codon (within the 3’ UTR) and signals for the addition of a poly-A tail that varies in length as a post-transcriptional modification on mRNA. It is located 10-30 nucleotides upstream of the 3’ cleavage site. Poly(A) tail is believed to stimulate translation initiation whereas its shortening triggers entry of mRNA into the decay pathway. Besides the functional importance in maturation of mRNA as well as its transport, degradation and translation, poly-A tail is also important in preferential extraction of mRNA from total RNA. This approach, however, requires attention as not all but 90% of mRNAs have poly-A tails (histone mRNAs, for example).


Untranslated Regions (UTRs): Transcription often terminates at 0.5 - 2 kb downstream of the poly-A signal determined by transcription termination signals. The DNA sequence between the termination signal (for translation) to the end of transcription termination is called 3' UTR. 5' UTR usually contains gene- or developmental stage-specific and common regulators of expression (motifs, boxes, response or binding elements), and 3' UTR is also involved in gene expression although it does not contain well-known transcription control sites. 3' UTR sequences (called cytoplasmic polyadenylation elements or adenylation control elements) can control the nuclear export, polyadenylation status, subcellular targeting, rates of translation and degradation of mRNA (see Decker & Parker , 1995Pesole, 2001 for reviews). Many mRNAs with short half-lives contain a 50-nucleotide AU-rich sequence (reiterations of AUUUA) in 3’ UTR (AU-rich elements; 3'AURE). Removal or alteration of this sequence prolongs the half-life of mRNA. The presence of this sequence may be a feature of genes that can rapidly alter their expression level. 3'AURE is found in the genes encoding cytokines, adhesion molecules, and protooncogenes and may be a marker of mRNAs that are inducible by environmental stressors (Asson-Batres, 1994). The involvement of 3' UTR is well documented in controlling male and female gametogenesis and in early embryonic development. Myotonic dystrophy is a disease caused by the expansion of the triplet repeats in the 3' UTR of a protein kinase gene, DMPK. Genetic variation in the 3'UTR of NF1 affects expression levels (Cowley, 1998).


A difference in phenotype that is dependent on the position of a gene or a group of genes is called position effect. This is often due to the presence of heterochromatin nearby. The change in a gene's location may cause a change in its expression (a problem that has to be overcome in gene therapy). 


In summary, the stages of protein synthesis are: transcription, RNA processing, translation and post-translational modifications (such as glycosylation, deamidation, acetylation, hydroxylation, sulfation, lipidation, methylation or phosphorylation; including removal of the N-terminal methionine in most proteins). Such posttranslational changes in the molecule may play a role in disease pathogenesis despite no change in the genetic code. An example is the replacement of beta-82 lysine by asparagine (N) or aspartic acid (D) in haemoglobin (Charache, 1977). Several other posttranslational deamidation in haemoglobin molecule have been reported (see OMIM 141900). In Celiac disease, the deamidation of gamma-gliadin creates an epitope that acts as the self-antigen in the initiation of this autoimmune disease (Molberg, 1998). Deacetylation of lysine or methylation of lysine and arginine in histone molecules results in gene silencing similar to methylation of CpG sequences in the DNA (histone molecules can also undergo ubiquitination or phosphorylation as post-translational modifications).


An example of gene expression regulation

Cellular iron uptake and storage are regulated through a feedback control mechanism mediated at the post-transcriptional level by cytoplasmic factors know as iron-regulatory proteins 1 and 2. These proteins sense levels of iron in the transit pool. When iron in this pool is scarce, they bind to stem-loop structures known as iron-responsive elements on the 5' untranslated region of the ferritin mRNA and 3' untranslated region of the transferrin mRNA. Such a binding inhibits translation of ferritin mRNA and stabilizes the mRNA for transferrin receptors. The opposite scenario develops when iron in the transit pool is plentiful (Ponka, 1998). An interesting example of the regulation of gene expression is random allelic inactivation. This occurs in autosomal genes (Rhoades, 2000) similar to X-inactivation or parental imprinting. Genes that show random allelic inactivation include olfactory receptor genes, and the various genes encoding antigen receptors on lymphocytes (immunoglobulin genes, T cell receptor genes and NK receptor genes (see NK Cell Receptors)).


Epigenetic effects on gene expression

Genetic phenomena without any DNA nucleotide change that affect gene expression include paramutation (paramutation is an allelic interaction that results in meiotically heritable changes in gene expression in plants), methylation, genomic imprinting, allelic exclusion and histone modification / histone code hypothesis (see glossary). Such epigenetic changes and especially their heritability are one of the very hot debates of recent years (see Hidden Inheritance by G Vines in New Scientist, 28 Nov 1998, pp.27-30; Epigenetics: Special Issue of Science, 2001; Rando, 2007). The National Fragile X Foundation website explains the molecular basis of a well-known epigenetic disease, fragile X syndrome. This disease is due to CGG triplet repeat expansion, which triggers methylation of CpGs in the repeat track and subsequently silencing of the FMR1 gene promoter (Coffee, 2002). For a long time, epigenetic regulation of gene activity has been synonymous with DNA methylation. It is now better appreciated that epigenetic changes do not depend only on DNA methylation but also on a number of covalent histone modifications. There is even a link between DNA methylation and histone modification resulting in chromatin remodelling with subsequent alteration of gene activity. All of these interactions are collectively called epigenetic crosstalk (for a review, see Weissmann & Lyko, 2003). DNA methylation indeed has a direct effect on gene expression by modulating the interaction between transcription factors and DNA but it also exerts indirect effects on the regulation of gene expression through chromatin modifications. Transcriptionally inactive heterochromatin is packed densely whereas active euchromatin is less condensed. Acetylation, phosphorylation and methylation of histone proteins affect gene expression via induction of changes in the state of chromatin. For example, histone deacetylation and methylation of histone H3 at lysine 9 (H3-K9) mark for (inactive) heterochromatin, which hinder transcription factor access to their binding sites in regulatory regions of genes (see histone in glossary; histone acetylation and chromatin remodelling).


Common techniques to measure gene expression and regulation of gene expression are Northern blotting, S1 ribonuclease protection assay, DNA footprinting, electrophoretic mobility shift assay (EMSA), quantitative reverse-transcription (RT) and real-time PCR. These are briefly described below (see also the biotechnology section). For further details, visit the Gene Quantification Page by MW Pfaffl and Comparison of Different Methods of mRNA Quantitation (Ambion).


The following features of eukaryotic genomes can complicate bioinformatic studies of ‘gene finding’:

1. The genes are not located end-to-end but instead separated by long stretches of intergenic ‘junk’ DNA (low gene density),

2. Alternative splicing creates unexpected complexity in the products of the same gene when using mRNA fragments to predict genes,

3. Genes nested within each other or overlapping genes (complex gene structure),

4. Pseudogenes.

See Computational Gene Recognition Programs for bioinformatics resources.


Analysis of Gene Expression

(See mRNA Transcript Analysis in Cancer Medicine e5 Online and Comparison of Different Methods of mRNA Quantitation-Ambion.)


Northern blot analysis: Northern blotting or transfer is a variation of the Southern technique, differing in that mRNA is the target material for hybridisation and there is no need for restriction endonuclease digestion because mRNA is in the size range that is appropriate for electrophoresis and transfer. Because secondary structure can lead to aberrant electrophoretic behaviour, RNA is electrophoretically separated by size in the presence of a denaturing agent (usually formaldehyde or glyoxal/DMSO). After electrophoresis through a denaturing agarose gel, the RNA is transferred to a nitrocellulose or nylon-based membrane as in Southern blotting. Hybridisation schemes and blot washing are essentially the same for Northern blotting as for Southern blotting. An estimate of the amount of RNA in the sample is estimated from the intensity of the autoradiographic signal. Thus, Northern analysis can show quantitative and qualitative information on steady-state mRNA transcripts and can identify the presence of amplified gene expression.


RNA can be isolated from cells in its intact form, free from DNA (DNAse treatment may be necessary to eliminate any contaminating DNA). As in all RNA-related work, special precautions must be used in extracting and manipulating the RNA to avoid RNases from destroying the molecule (RNA Isolation: The Basics (Ambion). See also Northern Analysis: The Basics (Ambion) and Northern blot movie.


There is a limit to the sensitivity of Northern blotting; only moderately abundant mRNAs can be detected using this technique. The sensitivity of Northern blotting can be increased by using pure mRNA rather than total RNA. mRNA makes up less than 10% of the total RNA content of a cell or tissue. The rest is hnRNA, ribosomal and transfer RNA. Most mRNAs destined for the cytoplasm and translation are modified by the addition of a 3’ poly-A tail. This information can be used to enrich the RNA preparation for mRNA species by removing all RNA molecules that lack the 3’ poly-A tail. This way, the sensitivity of Northern blotting is improved by nearly two orders of magnitude. Not all genes, however, have poly-A tails (like histone genes) and this should be taken into account in the analysis. Even more sensitive biochemical methods have been used to identify proteins that bind to the different elements in eukaryotic promoters. These are outlined below:


RNase Protection Assay (RPA): Another technique used in the analysis of mRNA is the nuclease S1 (RNAse) protection assay (RPA). This assay is more sensitive than Northern blotting and is therefore used for the detection of rare mRNA species. It also provides detailed structural information about the mRNA being analysed, and is thus referred to as transcript mapping. The basic method involves the protection of double-stranded hybrids from the degrading action of nucleases. A hybrid is formed between mRNA and a complementary DNA probe. The annealed mixture is subjected to digestion with an enzyme specific for single-stranded DNA (usually S1 nuclease). These enzymes specifically degrade single-stranded DNA of RNA and leaves double-stranded structures intact. In other words, areas in the probe that anneal to the mRNA are protected from digestion by the nucleases. The end-products are then analysed by polyacrylamide gel electrophoresis. The extent to which the RNA protects the radiolabelled DNA probe from degradation indicates the structure of the RNA. The amount of radiolabelled probe resistant to digestion is proportional to the amount of target mRNA in the sample.

Since the nucleases that digest the annealed probe/mRNA hybrid are specific for single-stranded nucleotides, any mismatches between probe and target are susceptible to digestion. A mismatch can be detected if the nuclease-digested radiolabelled probe is smaller than expected, or when the probe has been digested into multiple fragments. In fact, by careful measurement of the length of the digested probe, exactly where the mismatch has occurred in the target mRNA can be detected. Because conditions for annealing RNA to DNA are highly selective, S1 nuclease protection analysis is very sensitive. See also Nuclease Protection Assays: The Basics (Ambion) and RPA movie.


DNase I Footprinting Assay: DNase I footprinting assays are based on the principle that any protein that is bound to a DNA fragment will protect the DNA from digestion by DNase I. It is used to determine the sites at which proteins bind to DNA. In experiments of this type, a DNA fragment is radiolabelled at one end. The labelled DNA is incubated with the DNA binding protein of interest and then subjected to partial digestion with DNase I. The DNA regions interacting with the protein can be identified by comparison of the digestion products of the protein-bound DNA with those resulting from identical DNase treatment of a parallel sample of DNA that was not incubated with protein (or incubated with another protein). The region of DNA that is protected by the DNA binding protein will appear as a gap (footprint) in the sequence of DNase I digestion pattern.


DNA footprinting can also be employed in vivo using permeabilised cells, or intact nuclei, are exposed to DNase I before isolation of DNA. In vitro and in vivo DNA footprinting assays often give different results. This suggests that the interactions or proteins involved may be different in the intact cell in comparison to naked DNA that is mixed with nuclear protein extracts.


Footprinting assays are able to analyse only 150-200 bp DNA segments in each assay. The number and nature of proteins bound remain unknown unless specific proteins are used in isolation in the assay. The sensitivity is low; it can only detect interactions with abundant proteins. Because Dnase can still cleave several basepairs away from the protein binding site, the resolution of the binding site is relatively low.


The electrophoretic mobility shift assay (EMSA): EMSA is more useful than the footprinting assay for quantitative analysis of DNA-protein binding reactions (also called gel-shift or band-shift assay). It measures the ability of purified proteins to bind to radiolabelled DNA-fragments (usually 10-50 bp). In this assay, the electrophoretic mobility of a radiolabelled oligonucleotide is determined in the presence and absence of a sequence-specific DNA-binding protein. The samples are electrophoresed under non-denaturing conditions. Protein binding generally reduces the mobility of a DNA fragment, causing a shift in the location of the fragment band detected, following non-denaturing polyacrylamide electrophoresis, by autoradiography. Free DNA probe that is not bound to protein will migrate the fastest.  Any probe that is bound to protein will be retarded, and will move more slowly through the gel. This technique is very useful for determining the exact nucleotides necessary for protein binding by using mutant oligonucleotides. See EMSA at Molecular Biology of the Cell.


Western Blot Analysis: This is an electrophoretic blotting technique used to detect and analyse proteins. Protein is subjected to electrophoresis on polyacrylamide gels that often contain a detergent to separate the molecules by their molecular weight. The separated protein is then transferred to a filter using high-voltage electrophoresis in a similar way to other blotting techniques. Enzyme-linked or isotopically labelled antibodies that are species-specific are applied to the filter. The antibodies bind to the proteins they are specific for and the protein antibody bands can then be visualised by either autoradiographic or calorimetric methods.


Serial Analysis of Gene Expression (SAGE): Serial analysis of gene expression (SAGE) is a method for comprehensive analysis of gene expression patterns. In SAGE, the investigator sequences a small and unique fragment of each expressed gene (called a SAGE tag) and quantifies the number of times it appears (called the SAGE tag number). The SAGE tag numbers, therefore, directly reflect the abundance of the corresponding transcript. Unlike DNA chip analysis, SAGE is able to detect and quantify the expression of previously uncharacterised genes. The results are equivalent to those obtained from constructing a cDNA library from the tissue of interest and sequencing every clone.

SAGE methodology can be summarised by the following three principles: (
1) a short sequence tag (10-17bp) contains sufficient information to uniquely identify a transcript, (2) sequence tags can be linked together to from long serial molecules that can be cloned and sequenced. Multiple 10-base-pair SAGE tags can be packaged in a single plasmid. This would reduce the number of plasmid preparations and DNA sequencing reactions that are required to analyse a large number of genes. A single sequencing reaction can provide information on 30 to 35 different SAGE tags, and therefore 30 to 35 different genes, (3) quantitation of the number of times a particular tag is observed provides the expression level of the corresponding transcript. Thus, in SAGE, the investigator sequences a small and unique fragment of each expressed gene (a SAGE tag) and quantifies the number of times it appears (the SAGE tag number). The SAGE tag numbers directly reflect the abundance of the corresponding transcript (see also SAGE for Beginners by EMBL; Applications of SAGE).


RNA Differential Display Analysis: This method is used to compare two cell populations. mRNA is isolated from both populations. Reverse transcription and PCR are performed using a poly-T primer, which will anneal to the poly-A tail of mRNA, and a set of primers with random hexamers, which by chance will anneal to sequences upstream of the poly-A tail in mRNA. Since the upstream primer will anneal at random to different mRNA species, the lengths of the PCR products will vary for nearly every mRNA. The amplification is performed in the presence of radiolabelled nucleotides so that the products from the two reactions can be visualised on a high-resolution polyacrylamide gel. Bands that are much darker in one lane compared with another represent mRNA species that were overexpressed in one cell population compared with another. There may also be total absence of one band in one of the populations examined. The cDNA representing such bands can be recovered from the gel for further analysis and identification.


DNA Microarray Analysis: Comparative gene expression profiling and genome-wide analysis of gene expression can be achieved by DNA microarrays (DNA chips). Either 25-nucleotide long fragments of known DNA sequences (oligonucleotide arrays for sequence variation studies) or cDNA fragments (cDNA arrays for expression profile studies) are immobilised on glass surfaces on a 1.3cm x 1.3cm microarray in a predetermined order (grid). Thousands of fragments can be stored on a single chip. The sample of interest (tumour, tissue, species) to be examined for gene expression profile should be available in a form that will allow RNA extraction. The RNA is labelled with fluorescent and hybridised with the fragments on the microarray. Hybridisation events are captured by scanning the surface of the microarray with a laser scanning device and measuring the fluorescence intensity at each position in the microarray. The fluorescence intensity of each spot on the array is proportional to the level of expression of the gene represented by that spot. DNA microarrays have been used to understand the cell cycle, haematopoietic differentiation, interferon gamma treatment and cancer classification. The ability to monitor the expression levels of thousands of genes simultaneously offers the opportunity to expand the analysis of cancer genetics beyond single–candidate gene approaches. Microarrays are capable of monitoring the expression levels of the entire human genome using nanograms of total RNA.

The challenge is, however, the interpretation of the microarray data. The key is to develop methods for recognizing meaningful gene expression patterns and distinguishing those patterns from noise. Such noise (random gene expression levels) can be generated by (1) variability among microarrays, (2) variability in RNA labelling and hybridisation methods, and perhaps most importantly, (3) biological variability among samples. It has become clear that the successful elucidation of genetic networks through expression profiling will require the expertise of a new generation of scientists (computational biologists). (See also Ambion Guide for Array Analysis; DNA Microarray Web site; Software AMIADA (Analyzing Microarray Data by X Xia.)


Real-time PCR is another method to analyse the expression of one or several specific genes quantitatively. This method requires first the conversion of mRNA to cDNA by reverse transcription. The rest is based on the same principles of conventional PCR with important modifications that allow quantitation. See real-time PCR and an Ambion article on real-time and quantitative RT-PCR.



Molecular Biology Links

Human Gene Expression in Human Molecular Genetics Online (Strachan & Read)

Gene Transcription chapter in the Molecular Biology Web Book

Eukaryotic Gene Expression Tutorial in the Biology Project

Molecular Structure of Genes and Chromosomes in Molecular Cell Biology

Regulation of Transcription Initiation in Molecular Cell Biology

Control of Gene Expression in the Medical Biochemistry Page

DNA Learning CenterDNA Interactive - ManipulationGenome - Applications

Biology Animations, Movies and TutorialsTranscription & Translation

DNA Down Under

Gene Quantification Page by MW Pfaffl

Differential Gene Expression Techniques

Animations of Transcription and Translation at Genetics (BA Pierce) Website

WH Freeman Biology Books Companion Sites   (Lodish 5e)

Biology 7/e (Raven et al): Online Learning Center: Online Labs: Chapter 15 Animations 

Critical Reviews™ in Eukaryotic Gene Expression

Gene Expression: mRNA Transcript Analysis & Protein Analysis in Cancer Medicine e5 Online

NIH WebCasts: Current Topics in Genome Analysis & Genome Analysis

  Genomics & Proteomics Links


BioInformatics Links

TransFac   TFSearch  (Transcription Factors and Binding Sites)

TRRD Database of Transcription Regulatory Regions of Eukaryotic Genes

Splice Predictor Online   GenSCAN   PromoSer   FIEv2   Other Gene Prediction Programs 

DNALC Bioinformatics in the Classroom Course Notes: Identifying Genes in DNA Sequences

Bibliography on Computational Gene Recognition 

Rogic et al. Evaluation of Gene-Finding Programs. Genome Res 2001

Mathe et al. Current methods of gene prediction, their strengths and weaknesses. NAR 2002


Statistical Analysis of Gene Expression by Terry Speed (Berkeley) (PPT)

Bassett et al. Gene Expression Informatics (review). Nat Genet 1999

RT-PCR Primer Bank   RT-PCR Primer DataBase   RT-PCR Primer Sets

Quantitative PCR Primer Database - QPPD (NCI)


BioInformatics Slide Presentation I & II

BMC BioInformatics   BMC Genomics

Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins (Wiley, 2001)

Bioinformatics and Genome Analysis (Springer-Verlag, 2002)

Handbook of Statistical Genetics (Wiley, 2003), see TOC (includes a chapter on Gene Prediction)


Science Magazine Gene Expression Link

Science Magazine Genes in Action Special Issue (22 Oct 2004)


Address for bookmark: 


M.Tevfik Dorak, MD, PhD


Last updated on 14 July 2013


Genetics      Evolution     HLA     MHC     Epidemiology     Genetic Epidemiology     Population Genetics    Glossary     Homepage