Next Generation Sequencing
After the human genome was painstakingly sequenced using capillary electrophoresis, it was obvious that incredible scientific potential was present within the genetic material of every cell, but that the state of technology was limiting what it was possible to study. The sequencing strategies that were developed since then are often called “next generation sequencing”, or NGS, and are some of the most important contributors to the explosion of our understanding of genomics, transcriptomics, metagenomics, epigenetics, and more (as well as to the massive growth of the bioinformatics field, just to handle all the incoming data!). Here, I’ve spotlighted three of the primary platforms with a brief summary of their technological approach, optimal scientific applications, and primary disadvantages.
With sequencers ranging from the ultra-portable MinION to the population-scale PromethION, Oxford Nanopore is responsible for a uniquely bio-engineered approach to sequencing that uses ion channel proteins to isolate and evaluate individual strands of DNA or RNA. Each protein is referred to as a nanopore and is embedded in a thin layer of material; the nucleotide sequence is determined by the current change across the pore as it lets each successive nucleotide of the strand through to the other side of the material barrier.
This strategy allows sequencing of all different lengths of DNA, from small RNAs all the way up to entire genomes (in all seriousness, while the average size of a long Nanopore read is about 23KB long, the record is 2.273MB – larger than many bacterial genomes). This read length capacity has some obvious implications for the application of Nanopore technology! Full-length transcriptomic reads can enable the identification of novel isoforms and alternative splicing junctions, while long genomic DNA reads can facilitate de novo genome assembly and identify complex structural variants and genomic rearrangements seen in certain cancers and genetic diseases.
Another advantage of Nanopore sequencing is its ability to sequence DNA and RNA fragments directly, without extensive barcoding, amplification, or other preparation steps other than adapter ligation. This allows epigenetic modifications such as methylation to be analyzed with less cost and without the need for secondary labeling steps that could influence which markers are detected.
Finally, the development of the extremely portable MinION allows researchers to bring NGS technology directly to isolated environments (such as the space station) and urgent situations (such as epidemic outbreaks). It can enable quick on-site metagenomic evaluation of water samples, assist with pathogen identification in patients with unknown disease (especially in remote areas), and serve as an exciting science education tool in schools or at community events.
The primary drawback to Nanopore technology is its relatively high error rate, around 10% on average across a read. Because the current change as each nucleotide passes through the nanopore is influenced by the nearest 3-5 nucleotides, as opposed to a single nucleotide, evaluation of each base call can be challenging. Additionally, the timing of individual current changes can be inconsistent; a change with short duration can be missed, leading to a deletion error, or the number of bases in a single nucleotide repeat can be misinterpreted. This can make it a poor choice for SNP detection or for strain-level metagenomics, as the level of biological difference may be on par with the error rate and therefor indistinguishable from it.
The other sequencing technology capable of very long reads is PacBio. While PacBio flowcells also rely on nanostructures that each sequence a single DNA molecule at a time, they employ wells in a thin metallic film (referred to as ZMWs) rather than protein pores in a membrane. During sequencing, the DNA molecule and a DNA polymerase are attached to the base of the well and each base is detected as the polymerase copies it through incorporation of fluorescently-tagged nucleotides. Originally, PacBio sequencing struggled with the same accuracy issues as Oxford Nanopore; sequencing a single molecule at a time means that there is essentially no signal-to-noise ratio to help exclude errors, so the raw data had an error rate of 13-15%. However, recent developments in both the sequencing strategy and downstream processing has substantially increased the accuracy, with reports of 99.8% average accuracy per read.
To obtain this accuracy, the long double-stranded DNA or RNA fragment to be sequenced is circularized by adapter ligation and then sequenced repeatedly around that circle in a single ZMW. In processing, the adapter sequences are removed, the remaining sequences are aligned to each other, and the consensus sequence is retained – providing an improved signal-to-noise ratio and making it possible to eliminate random sequencing errors made in only one or two of the repetitions.
PacBio sequencing is used for largely the same applications as Oxford Nanopore sequencing – it is an ideal strategy for handling de novo genome and transcriptome assembly thanks to its long and accurate reads, and is positioned to deliver reliable epigenetic data since it directly sequences single molecules and can capture those modifications. Unlike Oxford Nanopore, PacBio sequencing has the accuracy needed for SNP calling (and is even able to identify haplotype-specific variants!) and metagenomics.
The remaining disadvantage of PacBio sequencing is its increased cost (for individual library preparations as well as per GB of data) compared to both Oxford Nanopore and Illumina sequencing. Long-read sequencing in general also has fewer bioinformatics tools available than does short-read sequencing, and the tools that do exist may not be as well-adapted for the non-bioinformatician.
The final technology to discuss here is probably the most frequently used form of next-generation sequencing. Unlike Oxford Nanopore and PacBio, Illumina sequencing employs a short-read strategy, where the maximum read length is about 600bp long and massive numbers of these short fragments must be re-assembled to determine the source sequence. I’ve often compared it to deciphering the text of a novel using millions of overlapping phrases no more than a few words long – and often beginning or ending in the middle of a word!
During sequencing, millions of these fragments bind randomly to a lawn of oligos on the interior surface of the flowcell, are amplified to create clusters of many identical fragments representing an individual source template, and finally are sequenced one nucleotide at a time by alternately incorporating and removing fluorescently-tagged bases during synthesis.
Illumina sequencing has the highest accuracy of the three main NGS platforms, at 0.1-0.6% on average depending on the specific instrument used, making it the optimal choice for SNP calling, high-resolution microbial identification, and clinical applications, where a difference of 1-2 bases can make a significant difference. Illumina sequencing also has a relatively low entry cost and well-established bioinformatic pipelines, particularly for the commonly-used applications such as 16S/18S/ITS microbiome sequencing, small variant detection with reference genomes, and differential expression of genes with reference transcriptomes.
On the other hand, while it is possible to attempt genome assembly and epigenetic work with short-read sequencing, it is challenging to sequence and assemble high-repeat or centromeric regions with this strategy. Whole genome metagenomics can also benefit from the increased assembly resolution provided by long-reads, although most bioinformatic tools are currently still intended for short-read data. The fast-growing world of single-cell sequencing also uses the Illumina platform extensively at this point, since it typically involves model organisms where alignment to a reference rather than de novo assembly is required.
The best platform for your research depends on your experiment aims – and even once you’ve chosen a platform, there are other more detailed decisions to make within those parameters (such as read depth and replication). Through our partnerships around the state of Arizona, our core can provide streamlined access to all three of these NGS platforms as well as assistance with experimental design, library preparation, and data analysis. Please contact us if you have any questions!
For More Information:
Wang Y, Zhao Y, Bollas A et al. Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol 39, 1348–1365 (2021). https://doi.org/10.1038/s41587-021-01108-x (shareable link: https://rdcu.be/cXD00)
Deamer D, Akeson M & Branton D. Three decades of nanopore sequencing. Nat Biotechnol 34, 518–524 (2016). https://doi.org/10.1038/nbt.3423 (shareable link: https://rdcu.be/cXD0Z)
Weirather JL, de Cesare M, Wang Y et al. Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis. F1000Res. 2017;6:100. Published 2017 Feb 3. doi:10.12688/f1000research.10571.2
Wenger AM, Peluso P, Rowell WJ et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 37, 1155–1162 (2019). https://doi.org/10.1038/s41587-019-0217-9 (shareable link: https://rdcu.be/cXGGL)
Stoler N, Nekrutenko A. Sequencing error profiles of Illumina sequencing instruments. NAR Genomics and Bioinformatics, Volume 3, Issue 1, March 2021, https://doi.org/10.1093/nargab/lqab019
Bentley D, Balasubramanian S, Swerdlow H et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008). https://doi.org/10.1038/nature07517 (shareable link: https://rdcu.be/cXGZt)
All images are used courtesy of Oxford Nanopore, PacBio, and Illumina and are obtained from their promotional and/or educational resources.