Decrypting the Human Genome: Next Generation Sequencing – Part II
In the last installment (Part I), we reviewed representative short-read Next Generation Sequencing (NGS) technologies, which typically analyze 100 to 600 base pairs (bp). Short-read NGS has been useful for providing high through-put, high speed and low cost sequencing methods for efficiently mapping out the human whole genome sequence (WGS). In 2010, the 1000 Genomes Project was released to establish a database of human genetic variation. In just five years, over 2,500 human genomes from 26 different populations have been reconstructed.
Sequencing whole human genomes presents many technical challenges. Whole human genomes have a large number of long repetitive sequence segments of more than 1,000 bp, which cannot be distinguished by short-read instruments. Since it has been reported that each individual human genome has 2.7 to 4.1 million variants, then, for the 3.2 Gigabyte whole human genome, there is at least one variant per every 1,000 bases. These long repetitive sequence segments could include structure alteration and gene mutations relating to diseases, but might not be efficiently and accurately characterized by short-read NGS technologies. To address these challenges, long-read sequencing technologies have been developed and have already provided some astounding applications. For example, in 2014, scientists successfully applied nanopore sequencing technology developed by Oxford Nanopore Technologies (ONT) to monitor the transmission history and disease evolution of the Ebola virus, essentially in real time, during its outbreak. In this installment, we will discuss two of the main long-read sequencing technologies: (i) the synthetic approach and (ii) the single-molecule sequencing approach, and will review the relevant patents.
Synthetic approach based long-read NGS
The synthetic approach based long-read sequencing actually relies on short-read sequencing, but partitions long DNA fragments into short DNA segments categorized with different barcodes (short oligonucleotides with known sequence) for each specific long DNA fragment. After short-read sequencing the pool of short DNA segments, the sequence data of DNA segments with the same barcodes are then reassembled in silico to retrieve the sequence of the original long DNA fragments. Two representative sequencers using the synthetic approach based long-read sequencing technology are Illumina’s synthetic long-read sequencing platform developed by Moleculo and a sequencer developed by 10X Genomics. For the Illumina system, DNA is fragmented into 8-10 kb pieces and separated into microwells with ~3,000 molecules in a single well. In each well, the DNA fragments are enzymatically cleaved to ~350 bp segments and each segment is marked with the same barcode in the same well. The DNA segments are pooled and sequenced by standard short-read sequencing methods (US 9,249,460). For the 10X Genomics system, the DNA is fragmented into ~100 kb pieces and flowed through a microfluidic device to ideally encapsulate each fragment with a unique barcode sequence into a single micelle. Once encapsulated within the micelles, the DNA pieces are further cleaved into shorter segments, attached with barcode sequences and amplified for a subsequent sequencing step (US 9,388,465, US 9,694,361). Different from Illumina’s technology, the reassembled sequence from a single micelle does not seamlessly cover the entire original DNA fragment. Therefore, the 10X Genomics’ technology requires sufficient copies of the same DNA fragments to ensure full coverage of the sequence of the original DNA.
Single-molecule approach based long-read NGS
The single-molecule approach based long-read sequencing directly sequences DNA fragments without the necessity to amplifying the DNA fragments to enhance the signal. This reduces the cost and time for sample preparation and also alleviates biases and errors generated during amplification.
The first single-molecule sequencer was marketed by Helicos and originated from technology developed by Dr. Stephen Quake’s group at Stanford University (US 7,037,687, US 7,169,560, US 7,220,549, US 7,767,400). The single DNA molecule is attached to a solid support to form single molecule arrays, which are then sequenced by a synthesis method to incorporate a fluorophore deoxyribonucleotide triphosphate (dNTP) into a DNA strand for providing a fluorescent signal. The fluorescent signal from the tagged DNA strands is collected by total internal fluorescence (TIRF) microscopy to enhance the signal-to-noise ratio. By recording the incorporated nucleotide, the sequence of the DNA strand is mapped out. Direct Genomics further developed the Helicos technology and successfully released a third-generation sequencer, GenoCare, this past July. GenoCare has demonstrated lower error rates compared to other single-molecule sequencing technologies, i.e., 1.25% in the rate of deletions, 1.10% in the rate of mismatch, and 0.46% in the rate of insertion. In contrast, other technologies have shown about 15% errors. Although, currently the GenoCare system only reads about 30 bps, it still has the potential to increase the read length to be competitive with other long-read single-molecule sequencing technologies.
The most popular single-molecule platform is the single-molecule real-time (SMRT) sequencer developed by Pacific Biosciences (PacBio). PacBio separates DNA fragments into a flow cell having thousands of picolitre wells. The DNA fragments bind to the transparent bottom of the well, which is a zero-mode waveguide (ZMW). Fluorophore labelled dNTPs are added to each well for incorporation into the DNA strand. With each addition of dNTP, the color and duration of light emission in each ZMW are recorded by a camera to correlate to the sequence of the DNA fragments. Before the incorporation of a new dNTP, the fluorophore is cleaved from the incorporated dNTP and diffuses away from the ZMW (US 7,960,116, US 8,153,375). The PacBio sequencer is capable of reading single DNA molecules in excess of 50 kb with an average read length of 10-15 kb. However, the error rate for this long-read technology is about 15%. This drawback might be alleviated by multiple reading passes and high genome coverage.
Oxford Nanopore Technologies (ONT) developed a nanopore sequencer. It is an extremely small and compact (3 cm × 10 cm) USB-based device powered by a personal computer. This portable feature makes the nanopore sequencer suitable and accessible for rapid clinical usage. The nanopore sequencer directly detects the DNA composition when the DNA strand is passed through a protein pore. As the DNA translocates through the pore, meantime, a current is applied through the protein pore. The blockage of the pore by the DNA results in a voltage shift, which in turn modulates the current through the pore. The shift of current is recorded and converted into a particular k-mer sequence (US 9,447,152). There are more than 1,000 k-mer sequences correlating to all possible shifts of current. Currently, this nanopore sequencer has a larger error rate, about 30%. The accuracy of this technology could conceivably be improved by optimizing the algorithms for interpreting the k-mer library.
Prospects for NGS
NGS provides a low cost, fast and reliable means for sequencing human genomes. Nowadays, NGS is not a novelty but has the potential to become a standard diagnostic and analytical tool in clinical medicine. However, to further advance NGS’s application, there are still challenges to overcome. First, the NGS generates vast quantities of data. By 2013, the world already generated about 15 petabytes of sequencing data annually, with this data growing exponentially. This abundance of data requires not only plentiful data storage capacity but also sufficient data analysis capability to translate the genetic data into interpretable and meaningful biological information. Secondly, for clinical applications, it is critical to complete sample sequencing and data analysis in just several days, or even hours, especially for severe medical situations. Although NGS systems can finish sequencing in hours, in reality, sample preparation and data analysis still take a signification amount of time. Therefore, there is still plenty of opportunity to further improve current NGS technologies. Currently, we are facing an astonishing era for NGS and expect rapid technological advancement in this field. Consequently, increased demands of intellectual property protection for these new developments are also anticipated.
|US 9,249,460||Methods for obtaining a sequence||THE BOARD OF TRUSTEES OF THE LELAND STANFORD JUNIOR UNIVERSITY||Dmitry Pushkarev; Stephen R. Quake; Ayelet Voskoboynik; Michael Kertesz|
|US 9,388,465||Polynucleotide barcode generation||10X GENOMICS, INC.||Benjamin Hindson; Mirna Jarosz; Paul Hardenbol; Michael Schnall-Levin; Kevin Ness; Serge Saxonov|
|US 9,694,361||Fluidic devices, systems, and methods for encapsulating and partitioning reagents, and applications of same||10X GENOMICS, INC.||Rajiv Bharadwaj; Kevin Ness; Debkishore Mitra; Donald Masquelier; Anthony Makarewicz; Christopher Hindson; Benjamin Hindson; Serge Saxonov
|US 7,960,116||Nucleic acid sequencing methods and systems||Pacific Biosciences of California, Inc.||John Eid; Alex Dewinter|
|US 8,153,375||Compositions and methods for nucleic acid sequencing||Pacific Biosciences of California, Inc.||Kevin Travers; Geoff Otto; Stephen Turner; Cheryl Heiner; Congcong Ma|
|US 9,447,152||Base-detecting pore||Oxford Nanopore Technologies Limited||James Anthony Clarke; Lakmal Jayasinghe; Terence Reid; John Hagan Pryce Bayley|
|US 7,037,687||Method of determining the nucleotide sequence of oligonucleotides and DNA molecules||Arizona Board of Regent;
University of Alberta
|Peter Williams; Mark A. Hayes; Seth D. Rose; Linda B. Bloom; Linda J. Reha-Krantz; Vincent B.Pizziconi|
|US 7,169,560||Short cycle methods for sequencing polynucleotides||Helicos Biosciences Corporation||Stanley N. Lapidus; Philip Richard Buzby; Timothy Harris|
|US 7,220,549||Stabilizing a nucleic acid for nucleic acid sequencing||Helicos Biosciences Corporation||Philip Richard Buzby|
|US 7,767,400||Paired-end reads in sequencing by synthesis||Helicos Biosciences Corporation||Timothy D.Harris|
This article is for informational purposes, is not intended to constitute legal advice, and may be considered advertising under applicable state laws. The opinions expressed in this article are those of the author only and are not necessarily shared by Dilworth IP, its other attorneys, agents, or staff, or its clients.