Bioinformatics

Strategies Used in Sequencing Projects

The strategies used for sequencing genomes include the Sanger method, shotgun sequencing, pairwise end, and next-generation sequencing.

Learning Objectives

Compare the different strategies used for whole-genome sequencing:  Sanger method, shotgun sequencing, pairwise-end sequencing, and next-generation sequencing

Key Takeaways

Key Points

  • The Sanger method is a basic sequencing technique that uses fluorescently-labeled dideoxynucleotides (ddNTPs) during DNA replication which results in multiple short strands of replicated DNA that terminate at different points, based on where the ddNTP was incorporated.
  • Shotgun sequencing is a method that randomly cuts DNA fragments into smaller pieces and then, with the help of a computer, takes the DNA fragments, analyzes them for overlapping sequences, and reassembles the entire DNA sequence.
  • Pairwise-end sequencing is a type of shotgun sequencing which is used for larger genomes and analyzes both ends of the DNA fragments for overlap.
  • Next-generation sequencing is a type of sequencing which is automated and relies on sophisticated software for rapid DNA sequencing.

Key Terms

  • fluorophore: a molecule or functional group which is capable of fluorescence
  • contig: a set of overlapping DNA segments, derived from a single source of genetic material, from which the complete sequence may be deduced
  • dideoxynucleotide: any nucleotide formed from a deoxynucleotide by loss of an a second hydroxyl group from the deoxyribose group

Strategies Used in Sequencing Projects

The basic sequencing technique used in all modern day sequencing projects is the chain termination method (also known as the dideoxy method), which was developed by Fred Sanger in the 1970s. The chain termination method involves DNA replication of a single-stranded template with the use of a primer and a regular deoxynucleotide (dNTP), which is a monomer, or a single unit, of DNA. The primer and dNTP are mixed with a small proportion of fluorescently-labeled dideoxynucleotides (ddNTPs). The ddNTPs are monomers that are missing a hydroxyl group (–OH) at the site at which another nucleotide usually attaches to form a chain. Each ddNTP is labeled with a different color of fluorophore. Every time a ddNTP is incorporated in the growing complementary strand, it terminates the process of DNA replication, which results in multiple short strands of replicated DNA that are each terminated at a different point during replication. When the reaction mixture is processed by gel electrophoresis after being separated into single strands, the multiple, newly-replicated DNA strands form a ladder due to their differing sizes. Because the ddNTPs are fluorescently labeled, each band on the gel reflects the size of the DNA strand and the ddNTP that terminated the reaction. The different colors of the fluorophore-labeled ddNTPs help identify the ddNTP incorporated at that position. Reading the gel on the basis of the color of each band on the ladder produces the sequence of the template strand.

image

Sanger’s Method: Frederick Sanger’s dideoxy chain termination method uses dideoxynucleotides, in which the DNA fragment can be terminated at different points. The DNA is separated on the basis of size, and these bands, based on the size of the fragments, can be read.

image

Structure of a Dideoxynucleotide: A dideoxynucleotide is similar in structure to a deoxynucleotide, but is missing the 3′ hydroxyl group (indicated by the box). When a dideoxynucleotide is incorporated into a DNA strand, DNA synthesis stops.

Early Strategies: Shotgun Sequencing and Pair-Wise End Sequencing

In the shotgun sequencing method, several copies of a DNA fragment are cut randomly into many smaller pieces (somewhat like what happens to a round shot cartridge when fired from a shotgun). All of the segments are then sequenced using the chain-sequencing method. Then, with the help of a computer, the fragments are analyzed to see where their sequences overlap. By matching overlapping sequences at the end of each fragment, the entire DNA sequence can be reformed. A larger sequence that is assembled from overlapping shorter sequences is called a contig. As an analogy, consider that someone has four copies of a landscape photograph that you have never seen before and know nothing about how it should appear. The person then rips up each photograph with their hands, so that different size pieces are present from each copy. The person then mixes all of the pieces together and asks you to reconstruct the photograph. In one of the smaller pieces you see a mountain. In a larger piece, you see that the same mountain is behind a lake. A third fragment shows only the lake, but it reveals that there is a cabin on the shore of the lake. Therefore, from looking at the overlapping information in these three fragments, you know that the picture contains a mountain behind a lake that has a cabin on its shore. This is the principle behind reconstructing entire DNA sequences using shotgun sequencing.

Originally, shotgun sequencing only analyzed one end of each fragment for overlaps. This was sufficient for sequencing small genomes. However, the desire to sequence larger genomes, such as that of a human, led to the development of double-barrel shotgun sequencing, more formally known as pairwise-end sequencing. In pairwise-end sequencing, both ends of each fragment are analyzed for overlap. Pairwise-end sequencing is, therefore, more cumbersome than shotgun sequencing, but it is easier to reconstruct the sequence because there is more available information.

Next-generation Sequencing

Since 2005, automated sequencing techniques used by laboratories are under the umbrella of next-generation sequencing, which is a group of automated techniques used for rapid DNA sequencing. These automated, low-cost sequencers can generate sequences of hundreds of thousands or millions of short fragments (25 to 500 base pairs) in the span of one day. Sophisticated software is used to manage the cumbersome process of putting all the fragments in order.

Annotating Genomes

Genome annotation is the identification and understanding of the genetic elements of a sequenced genome.

Learning Objectives

Define genome annotation

Key Takeaways

Key Points

  • Once a genome is sequenced, all of the sequencings must be analyzed to understand what they mean.
  • Critical to annotation is the identification of the genes in a genome, the structure of the genes, and the proteins they encode.
  • Once a genome is annotated, further work is done to understand how all the annotated regions interact with each other.

Key Terms

  • BLAST: In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences.
  • in silico: In computer simulation or in virtual reality

Genome projects are scientific endeavors that ultimately aim to determine the complete genome sequence of an organism (be it an animal, a plant, a fungus, a bacterium, an archaean, a protist, or a virus). They annotate protein-coding genes and other important genome-encoded features. The genome sequence of an organism includes the collective DNA sequences of each chromosome in the organism. For a bacterium containing a single chromosome, a genome project will aim to map the sequence of that chromosome.

Once a genome is sequenced, it needs to be annotated to make sense of it. An annotation (irrespective of the context) is a note added by way of explanation or commentary. Since the 1980’s, molecular biology and bioinformatics have created the need for DNA annotation. DNA annotation or genome annotation is the process of identifying the locations of genes and all of the coding regions in a genome and determining what those genes do.

image

Genome Annotation: Here a small region of genome is annotated, with various elements identified. The annotation of an entire genome would entail a similar in depth analysis of thousand even millions of such DNA sequences.

Genome annotation is the process of attaching biological information to sequences. It consists of two main steps: identifying elements on the genome, a process called gene prediction, and attaching biological information to these elements. Automatic annotation tools try to perform all of this by computer analysis, as opposed to manual annotation (a.k.a. curation) which involves human expertise. Ideally, these approaches co-exist and complement each other in the same annotation pipeline (process). The basic level of annotation is using BLAST for finding similarities, and then annotating genomes based on that. However, nowadays more and more additional information is added to the annotation platform. The additional information allows manual annotators to deconvolute discrepancies between genes that are given the same annotation. Some databases use genome context information, similarity scores, experimental data, and integrations of other resources to provide genome annotations through their Subsystems approach. Other databases rely on both curated data sources as well as a range of different software tools in their automated genome annotation pipeline.

Structural annotation consists of the identification of genomic elements: ORFs and their localization, gene structure, coding regions, and the location of regulatory motifs. Functional annotation consists of attaching biological information to genomic elements: biochemical function, biological function, involved regulation and interactions, and expression.

These steps may involve both biological experiments and in silico analysis. Proteogenomics based approaches utilize information from expressed proteins, often derived from mass spectrometry, to improve genomics annotations. A variety of software tools have been developed to permit scientists to view and share genome annotations. Genome annotation is the next major challenge for the Human Genome Project, now that the genome sequences of human and several model organisms are largely complete. Identifying the locations of genes and other genetic control elements is often described as defining the biological “parts list” for the assembly and normal operation of an organism. Scientists are still at an early stage in the process of delineating this parts list and in understanding how all the parts “fit together. ”

Homologs, Orthologs, and Paralogs

Homology describes the relationship between genes and how they are inherited from ancestors.

Learning Objectives

Distinguish homologs, orthologs and paralogs

Key Takeaways

Key Points

  • A homologous gene (or homolog) is a gene inherited in two species by a common ancestor. While homologous genes can be similar in sequence, similar sequences are not necessarily homologous.
  • Orthologous are homologous genes where a gene diverges after a speciation event, but the gene and its main function are conserved.
  • If a gene is duplicated in a species, the resulting duplicated genes are paralogs of each other, even though over time they might become different in sequence composition and function.

Key Terms

  • conserved: In biology, conserved sequences are similar or identical sequences that occur within nucleic acid sequences (such as RNA and DNA sequences), protein sequences, protein structures.
  • selective pressure: Any cause that reduces reproductive success in a proportion of a population, potentially exerts evolutionary pressure or selection pressure.

Homology forms the basis of organization for comparative biology. A homologous trait is often called a homolog (also spelled homologue). In genetics, the term “homolog” is used both to refer to a homologous protein and to the gene ( DNA sequence) encoding it. As with anatomical structures, homology between protein or DNA sequences is defined in terms of shared ancestry. Two segments of DNA can have shared ancestry because of either a speciation event (orthologs) or a duplication event (paralogs). Homology among proteins or DNA is often incorrectly concluded on the basis of sequence similarity. The terms “percent homology” and “sequence similarity” are often used interchangeably. As with anatomical structures, high sequence similarity might occur because of convergent evolution, or, as with shorter sequences, because of chance. Such sequences are similar, but not homologous. Sequence regions that are homologous are also called conserved. This is not to be confused with conservation in amino acid sequences in which the amino acid at a specific position has been substituted with a different one with functionally equivalent physicochemical properties. One can, however, refer to partial homology where a fraction of the sequences compared (are presumed to) share descent, while the rest does not. For example, partial homology may result from a gene fusion event.

image

Example of Homologous DNA: This is the sequence alignment of a homologous protein from two different species The “*” represent a conserved amino acid in the two proteins.

Homologous sequences are orthologous if they were separated by a speciation event: when a species diverges into two separate species, the copies of a single gene in the two resulting species are said to be orthologous. Orthologs, or orthologous genes, are genes in different species that originated by vertical descent from a single gene of the last common ancestor. For instance, the plant Flu regulatory protein is present both in Arabidopsis (multicellular higher plant) and Chlamydomonas (single cell green algae). The Chlamydomonas version is more complex: it crosses the membrane twice rather than once, contains additional domains, and undergoes alternative splicing. However, it can fully substitute the much simpler Arabidopsis protein, if transferred from algae to plant genome by means of gene engineering. Significant sequence similarity and shared functional domains indicate that these two genes are orthologous genes, inherited from the shared ancestor. Orthologous sequences provide useful information in taxonomic classification and phylogenetic studies of organisms. The pattern of genetic divergence can be used to trace the relatedness of organisms. Two organisms that are very closely related are likely to display very similar DNA sequences between two orthologs. Conversely, an organism that is further removed evolutionarily from another organism is likely to display a greater divergence in the sequence of the orthologs being studied.

Homologous sequences are paralogous if they were separated by a gene duplication event: if a gene in an organism is duplicated to occupy two different positions in the same genome, then the two copies are paralogous. Paralogous genes often belong to the same species, but this is not necessary. For example, the hemoglobin gene of humans and the myoglobin gene of chimpanzees are paralogs. Paralogs can be split into in-paralogs (paralogous pairs that arose after a speciation event) and out-paralogs (paralogous pairs that arose before a speciation event). Between species out-paralogs are pairs of paralogs that exist between two organisms due to duplication before speciation. Within species out-paralogs are pairs of paralogs that exist in the same organism, but whose duplication event happened after speciation. Paralogs typically have the same or similar function, but sometimes do not. Due to lack of the original selective pressure upon one copy of the duplicated gene, this copy is free to mutate and acquire new functions. Paralogous sequences provide useful insight into the way genomes evolve. The genes encoding myoglobin and hemoglobin are considered to be ancient paralogs. Similarly, the four known classes of hemoglobins (hemoglobin A, hemoglobin A2, hemoglobin B, and hemoglobin F) are paralogs of each other. While each of these proteins serves the same basic function of oxygen transport, they have already diverged slightly in function: fetal hemoglobin (hemoglobin F) has a higher affinity for oxygen than adult hemoglobin. However, function is not always conserved. Human angiogenin diverged from ribonuclease, for example, and while the two paralogs remain similar in tertiary structure, their functions within the cell are now quite different.

Synthesizing DNA

DNA can be synthesized chemically for a number of purposes.

Learning Objectives

Outline the methods and uses of DNA synthesis

Key Takeaways

Key Points

  • DNA and RNA are at their essence chemical structures, and as such complex chemical reactions can be used to synthesize them.
  • There are enzymatic ways to amplify DNA, notably PCR, while DNA sequences can be chemically synthesized by a process known as oligosynthesis.
  • Oligosynthesis can be used to make artificial genes, which allows scientists to design and synthesis novel gene products, without relying a template of a gene found in nature.

Key Terms

  • phosphoramidite: Any of a class of organic compounds formally derived from a phosphite by replacing a >P-O-R with a >P-N<R2 group; used in the synthesis of nucleic acids, etc.
  • HPLC: High-performance liquid chromatography (sometimes referred to as high-pressure liquid chromatography), HPLC, is a chromatographic technique used to separate a mixture of compounds in analytical chemistry and biochemistry with the purpose of identifying, quantifying and purifying the individual components of the mixture.

To understand bacterial genetics, the underlying genetic material (i.e. DNA) must be understood. DNA must be synthesized to study genes, the sequence of genomes, and many other studies. This occurs in two fashions, by polymerase chain reaction (PCR) which is enzymatic and chemical synthesis. PCR is covered in another atom. Here we will focus on chemical synthesis of DNA, which is also known as oligonucleotide synthesis.

Oligonucleotide synthesis is the chemical synthesis of relatively short fragments of nucleic acids, both DNA and RNA with a defined chemical structure (sequence). The technique is extremely useful in current laboratory practice because it provides a rapid and inexpensive access to custom-made oligonucleotides of the desired sequence. Whereas enzymes synthesize DNA and RNA in a 5′ to 3′ direction, chemical oligonucleotide synthesis is carried out in the opposite, 3′ to 5′ direction.

image

Oligosynthesis.: The complex chemical reactions that are needed to couple one nucleotide to another are outlined here.

Currently, the process is implemented as solid -phase synthesis using phosphoramidite method and phosphoramidite building blocks derived from protected 2′-deoxynucleosides (dA, dC, dG, and T), ribonucleosides (A, C, G, and U), or chemically modified nucleosides, e.g. LNA. To obtain the desired oligonucleotide, the building blocks are sequentially coupled to the growing oligonucleotide chain in the order required by the sequence of the product. The process has been fully automated since the late 1970’s. Upon the completion of the chain assembly, the product is released from the solid phase to solution, deprotected, and collected. The occurrence of side reactions sets practical limits for the length of synthetic oligonucleotides (up to about 200 nucleotide residues) because the number of errors accumulates with the length of the oligonucleotide being synthesized. Products are often isolated by HPLC to obtain the desired oligonucleotides in high purity. Typically, synthetic oligonucleotides are single-stranded DNA or RNA molecules around 15–25 bases in length. Oligonucleotides find a variety of applications in molecular biology and medicine. They are most commonly used as antisense oligonucleotides, small interfering RNA, primers for DNA sequencing and amplification, probes for detecting complementary DNA or RNA via molecular hybridization, tools for the targeted introduction of mutations and restriction sites, and for the synthesis of artificial genes.

A further application of oligosynthesis is to make artificial genes. Artificial gene synthesis is the process of synthesizing a gene in vitro without the need for initial template DNA samples. The main method is currently by oligonucleotide synthesis (also used for other applications) from digital genetic sequences and subsequent annealing of the resultant fragments. In contrast, natural DNA replication requires existing DNA templates for synthesizing new DNA.

Amplifying DNA: The Polymerase Chain Reaction

The polymerase chain reaction (PCR) is a method by which DNA is amplified.

Learning Objectives

Illustrate the applications, components and steps of PCR

Key Takeaways

Key Points

  • PCR is used to amplify a specific region of DNA.
  • PCR typically consists of three steps: denaturation, annealing, and elongation.
  • The amplified DNA can be used for many purposes, such as identifying different genes and species of bacteria.

Key Terms

  • annealing: Annealing, in genetics, means for complementary sequences of single-stranded DNA or RNA to pair by hydrogen bonds to form a double-stranded polynucleotide. The term is often used to describe the binding of a DNA probe, or the binding of a primer to a DNA strand during a polymerase chain reaction (PCR). The term is also often used to describe the reformation (renaturation) of complementary strands that were separated by heat (thermally denatured). Proteins such as RAD52 can help DNA anneal.

The polymerase chain reaction (PCR) is a biochemical technology in molecular biology used to amplify a single, or a few copies, of a piece of DNA across several orders of magnitude, generating thousands to millions of copies of a particular DNA sequence.

Applications

Developed in 1983 by Kary Mullis, PCR is now a common and often indispensable technique used in medical and biological research labs for a variety of applications including the following:

  • DNA cloning for sequencing; DNA-based phylogeny, or functional analysis of genes
  • The diagnosis of hereditary diseases
  • The identification of genetic fingerprints (used in forensic sciences and paternity testing)
  • The detection and diagnosis of infectious diseases

The method relies on thermal cycling, consisting of cycles of repeated heating and cooling of the reaction for DNA melting and enzymatic replication of the DNA. Primers (short DNA fragments) containing sequences complementary to the target region, along with a DNA polymerase (after which the method is named) are key components to enable selective and repeated amplification. As PCR progresses, the DNA generated is itself used as a template for replication, setting in motion a chain reaction in which the DNA template is exponentially amplified. PCR can be extensively modified to perform a wide array of genetic manipulations.

Components

PCR is used to amplify a specific region of a DNA strand (the DNA target). Most PCR methods typically amplify DNA fragments of up to ~10 kilo base pairs (kb), although some techniques allow for amplification of fragments up to 40 kb in size. The reaction produces a limited amount of final amplified product that is governed by the available reagents in the reaction, and the feedback-inhibition of the reaction products. A basic PCR set up requires the following components and reagents:

  • DNA template that contains the DNA region (target) to be amplified
  • Two primers that are complementary to the 3′ (three prime) ends of each of the sense and anti-sense strand of the DNA target
  • Taq polymerase or another DNA polymerase with a temperature optimum at around 70 °C
  • Deoxynucleoside triphosphates (dNTPs; nucleotides containing triphosphate groups), the building-blocks from which the DNA polymerase synthesizes a new DNA strand
  • Buffer solution, providing a suitable chemical environment for optimum activity and stability of the DNA polymerase.Divalent cations, magnesium or manganese ions; generally Mg2+
  • Monovalent cation potassium ions

Typically, PCR consists of a series of 20-40 repeated temperature changes, called cycles, with each cycle commonly consisting of two to three discrete temperature steps, usually three. The temperatures used, and the length of time they are applied in each cycle, depend on a variety of parameters. These include the enzyme used for DNA synthesis, the concentration of divalent ions and dNTPs in the reaction, and the melting temperature (Tm) of the primers.

Steps

The following are the steps of PCR:

image

The Steps of PCR: This illustrates a PCR reaction to demonstrate how amplification leads to the exponential growth of a short product flanked by the primers. 1. Denaturing at 96°C. 2. Annealing at 68°C. 3. Elongation at 72°C. The first cycle is complete. The two resulting DNA strands make up the template DNA for the next cycle, thus doubling the amount of DNA duplicated for each new cycle.

  1. Denaturation step: This step is the first regular cycling event and consists of heating the reaction to 94-98°C. It causes DNA melting of the DNA template by disrupting the hydrogen bonds between complementary bases, yielding single-stranded DNA molecules.
  2. Annealing step: The reaction temperature is lowered to 50-65°C for 20-40 seconds allowing annealing of the primers to the single-stranded DNA template.
  3. Extension/elongation step: The temperature at this step depends on the DNA polymerase used; Taq polymerase has its optimum activity temperature at 75-80°C, and commonly a temperature of 72°C is used with this enzyme. At this step the DNA polymerase synthesizes a new DNA strand complementary to the DNA template strand by adding dNTPs that are complementary to the template in 5′ to 3′ direction, condensing the 5′-phosphate group of the dNTPs with the 3′-hydroxyl group at the end of the nascent (extending) DNA strand.

After elongation, the cycle goes back to step one, usually for 20-40 cycles. Under optimum conditions (i.e., if there are no limitations due to limiting substrates or reagents) at each extension step, the amount of DNA target is doubled, leading to exponential (geometric) amplification of the specific DNA fragment.

DNA Sequencing Based on Sanger Dideoxynucleotides

Sanger sequencing is based on the incorporation and detection of labeled ddNTPs as terminal nucleotides in DNA amplification.

Learning Objectives

Recall dideoxynucleotide sequencing

Key Takeaways

Key Points

  • The lack of the second deoxy group on an dNTP making it ddNTP, stops the incorporation of further nucleotides, this termination creates DNA lengths stopped at every nucleotide, this is central to further identifying each nucleotide.
  • Different labels can be used, ddNTPS, dNTPs and primers can all be labelled with radioactivity and fluorescently.
  • Using fluorescent labels, dideoxy sequencing can be automated allowing high-throughput methods which have been utilized to sequence entire genomes.

Key Terms

  • chromatogram: The visual output from a chromatograph. Usually a graphical display or histogram.
  • dideoxynucleotide: Any nucleotide formed from a deoxynucleotide by loss of an a second hydroxy group from the deoxyribose group

Sanger sequencing, also known as chain-termination sequencing, refers to a method of DNA sequencing developed by Frederick Sanger in 1977. This method is based on amplification of the DNA fragment to be sequenced by DNA polymerase and incorporation of modified nucleotides – specifically, dideoxynucleotides (ddNTPs).

The classical chain-termination method requires a single-stranded DNA template, a DNA primer, a DNA polymerase, normal deoxynucleotidetriphosphates (dNTPs), and modified nucleotides (dideoxyNTPs) that terminate DNA strand elongation. These chain-terminating nucleotides lack a 3′-OH group required for the formation of a phosphodiester bond between two nucleotides, causing DNA polymerase to cease extension of DNA when a ddNTP is incorporated. The ddNTPs may be radioactively or fluorescently labelled for detection in automated sequencing machines.The DNA sample is divided into four separate sequencing reactions, containing all four of the standard deoxynucleotides (dATP, dGTP, dCTP and dTTP) and the DNA polymerase. To each reaction is added only one of the four dideoxynucleotides (ddATP, ddGTP, ddCTP, or ddTTP). Following rounds of template DNA extension from the bound primer, the resulting DNA fragments are heat denatured and separated by size using gel electrophoresis. This is frequently performed using a denaturing polyacrylamide-urea gel with each of the four reactions run in one of four individual lanes (lanes A, T, G, C). The DNA bands may then be visualized by autoradiography or UV light and the DNA sequence can be directly read off the X-ray film or gel image.

image

Sanger sequencing: Different types of Sanger sequencing, all of which depend on the sequence being stopped by a terminating dideoxynucleotide (black bars).

Technical variations of chain-termination sequencing include tagging with nucleotides containing radioactive phosphorus for radiolabelling, or using a primer labeled at the 5′ end with a fluorescent dye. Dye-primer sequencing facilitates reading in an optical system for faster and more economical analysis and automation. The later development by Leroy Hood and coworkers of fluorescently labeled ddNTPs and primers set the stage for automated, high-throughput DNA sequencing. Chain-termination methods have greatly simplified DNA sequencing. More recently, dye-terminator sequencing has been developed. Dye-terminator sequencing utilizes labelling of the chain terminator ddNTPs, which permits sequencing in a single reaction, rather than four reactions as in the labelled-primer method. In dye-terminator sequencing, each of the four dideoxynucleotide chain terminators is labelled with fluorescent dyes, each of which emit light at different wavelengths.

image

Chromatograph: This is an example of the output of a Sanger sequencing read using fluorescently labelled dye-terminators. The four DNA bases are represented by different colours which are interpreted by the software to give the DNA sequence above.

Automated DNA-sequencing instruments (DNA sequencers) can sequence up to 384 DNA samples in a single batch (run) in up to 24 runs a day. DNA sequencers carry out capillary electrophoresis for size separation, detection and recording of dye fluorescence, and data output as fluorescent peak trace chromatograms. Automation has lead to the sequencing of entire genomes.

Metagenomics

Metagenomics is the study of genetic material derived from environmental samples.

Learning Objectives

Summarize the utility of metagenomics

Key Takeaways

Key Points

  • While previous work needed cultivation of single microbes before they could be sequenced and identified, metagenomics attempts to more completely identify many of the microbes that inhabit a given environmental location.
  • The first attempts at metagenomics was to sequence one gene from a sample. The changes in that one gene helped determine the microbial diversity in a sample.
  • High-throughput sequencing allows the complete sequencing and assembly of entire genomes of the microbes that inhabit a given environment, giving unprecedented depth into understanding the microbial diversity of the world around us.

Key Terms

  • solid: SOLiD (Sequencing by Oligonucleotide Ligation and Detection) is a next-generation DNA sequencing technology developed by Life Technologies and has been commercially available since 2008. This next generation technology generates hundreds of millions to billions of small sequence reads at one time.
  • gigabase: One billion bases (nucleotides) as a unit of length of a nucleic acid
  • pyrosequencing: A technique used to sequence DNA using chemiluminescent enzymatic reactions

Metagenomics

Metagenomics is the study of metagenomes; genetic material recovered directly from environmental samples. The broad field may also be referred to as environmental genomics, ecogenomics or community genomics. While traditional microbiology and microbial genome sequencing and genomics rely upon cultivated clonal cultures, early environmental gene sequencing cloned specific genes (often the 16S rRNA gene) to produce a profile of diversity in a natural sample. Such work revealed that the vast majority of microbial biodiversity had been missed by cultivation-based methods. Recent studies use “shotgun” Sanger sequencing or massively parallel pyrosequencing to get largely unbiased samples of all genes from all the members of the sampled communities. Due to its ability to reveal the previously hidden diversity of microscopic life, metagenomics offers a powerful lens for viewing the microbial world that has the potential to revolutionize understanding of the entire living world.

Conventional Sequencing Studies

Conventional sequencing begins with a culture of identical cells as a source of DNA. However, early metagenomic studies revealed that there are probably large groups of microorganisms in many environments that cannot be cultured and thus cannot be sequenced. These early studies focused on 16S ribosomal RNA sequences which are relatively short, often conserved within a species, and generally different between species. Many 16S rRNA sequences have been found which do not belong to any known cultured species, indicating that there are numerous non-isolated organisms. These surveys of ribosomal RNA (rRNA) genes taken directly from the environment revealed that cultivation based methods find less than 1% of the bacterial and archaeal species in a sample.

Shotgun Metagenomics

Advances in bioinformatics, refinements of DNA amplification, and the proliferation of computational power have greatly aided the analysis of DNA sequences recovered from environmental samples, This allows the adaptation of shotgun sequencing to metagenomic samples. The approach, used to sequence many cultured microorganisms and the human genome, randomly shears DNA, sequences many short sequences, and reconstructs them into a consensus sequence. Shotgun sequencing and screens of clone libraries reveal genes present in environmental samples. This provides information both on which organisms are present and what metabolic processes are possible in the community. This can be helpful in understanding the ecology of a community, particularly if multiple samples are compared to each other.

image

Environmental Shotgun Sequencing (ESS): (A) sampling from habitat; (B) filtering particles, typically by size; (C) Lysis and DNA extraction; (D) cloning and library construction; (E) sequencing the clones; (F) sequence assembly into contigs and scaffolds

Shotgun metagenomics is also capable of sequencing nearly complete microbial genomes directly from the environment. As the collection of DNA from an environment is largely uncontrolled, the most abundant organisms in an environmental sample are most highly represented in the resulting sequence data. To achieve the high coverage needed to fully resolve the genomes of under-represented community members, large samples are needed. On the other hand, the random nature of shotgun sequencing ensures that many of these organisms, which would otherwise go unnoticed using traditional culturing techniques, will be represented by at least some small sequence segments.

High-Throughput Sequencing

The first metagenomic studies conducted using high-throughput sequencing used massively parallel 454 pyrosequencing. Two other technologies commonly applied to environmental sampling are the Illumina Genome Analyzer II and the Applied Biosystems SOLiD system. These techniques for sequencing DNA generate shorter fragments than Sanger sequencing; 454 pyrosequencing typically produces ~400 bp reads, Illumina and SOLiD produce 25-75 bp reads. These read lengths are significantly shorter than the typical Sanger sequencing read length of ~750 bp.

However, this limitation is compensated for by the much larger number of sequence reads. Pyrosequenced metagenomes generate 200–500 megabases, while Illumina platforms generate around 20–50 gigabases. An additional advantage to short read sequencing is that this technique does not require cloning the DNA before sequencing, removing one of the main biases in environmental sampling. As most short-read assembly software was not designed for metagenomic applications, specialized methods have been developed to utilize mate-read data in metagenomic assembly. From these studies the microbial fauna that might reside in a sample of soil, even on the surface of a keyboard, can be more accurately and efficiently identified.

Reporter Fusions

A reporter fusion is the hybrid of a gene or portion of a gene with a tractable marker.

Learning Objectives

Explain reporter fusions

Key Takeaways

Key Points

  • A reporter construct allows the study of gene ‘s function and localization of a gene product.
  • The promoter reporter constructs allow a protein to be expressed under the control of a target gene.
  • Reporter fusions can fuse a protein of interest to a protein with a property of interest, therefore allowing the tagged protein to be further studied.

Key Terms

  • substrate analog: Substrate analogs (substrate state analogues), are chemical compounds with a chemical structure that resemble the substrate molecule in an enzyme-catalyzed chemical reaction.
  • luminescent: Emitting light by luminescence.

In molecular biology, a reporter gene (often simply reporter) is a gene that researchers attach to a regulatory sequence of another gene of interest in bacteria, cell culture, animals, or plants. Certain genes are chosen as reporters because the characteristics they confer on organisms expressing them are easily identified and measured, or because they are selectable markers. Reporter genes are often used as an indication of whether a certain gene has been taken up by or expressed in the cell or organism population.

To introduce a reporter gene into an organism, scientists place the reporter gene and the gene of interest in the same DNA construct to be inserted into the cell or organism. For bacteria or prokaryotic cells in culture, this is usually in the form of a circular DNA molecule called a plasmid. It is important to use a reporter gene that is not natively expressed in the cell or organism under study, since the expression of the reporter is being used as a marker for successful uptake of the gene of interest. Commonly used reporter genes that induce visually identifiable characteristics usually involve fluorescent and luminescent proteins. Examples include the gene that encodes jellyfish green fluorescent protein (GFP), which causes cells that express it to glow green under blue light, the enzyme luciferase, which catalyzes a reaction with luciferin to produce light, and the red fluorescent protein from the gene dsRed. A common reporter in bacteria is the E. coli lacZ gene, which encodes the protein beta-galactosidase. This enzyme causes bacteria expressing the gene to appear blue when grown on a medium that contains the substrate analog X-gal. An example of a selectable-marker which is also a reporter in bacteria is the chloramphenicol acetyltransferase (CAT) gene, which confers resistance to the antibiotic chloramphenicol.

Reporter genes can also be used to assay for the expression of the gene of interest, which may produce a protein that has little obvious or immediate effect on the cell culture or organism. In these cases the reporter is directly attached to the gene of interest to create a gene fusion. The two genes are under the same promoter elements and are transcribed into a single messenger RNA molecule. The mRNA is then translated into protein. In these cases it is important that both proteins be able to properly fold into their active conformations and interact with their substrates despite being fused. In building the DNA construct, a segment of DNA coding for a flexible polypeptide linker region is usually included so that the reporter and the gene product will only minimally interfere with one another. This is often done with GFP. The resulting protein-GFP hybrid transcribed from the reporter construct now has a protein attached to GFP. In the case of GFP which fluorescence one can deduce that the attached protein is wherever the fluorescence is. This allows a researched to determine where in a cell a protein may be localized in a cell.

image

Streptococci: Light microscopy view of streptococci, a non-sporulating lactic acid bacteria.

image

Introducing a reporter gene into a cell: In molecular biology, a reporter gene (often simply reporter) is a gene that researchers attach to a regulatory sequence of another gene of interest in bacteria, cell culture, animals, or plants

image

GFP fusion proteins: A human mesenchymal stem cell. In this cell a microtubule protein is fused to GFP (green) while a histone protein is fused to RFP (red). As you can see the localization of the fused protein can now be determined using fluorescent reporter fusions.