ABSTRACT
Transcriptome profiling has become routine in studies of many biological processes. However, the favored approaches such as short-read Illumina RNA sequencing are giving way to long-read sequencing platforms better suited to interrogating the complex transcriptomes typical of many RNA and DNA viruses. Here, we provide a guide—tailored to molecular virologists—to the ins and outs of viral transcriptome sequencing and discuss the strengths and weaknesses of the major RNA sequencing technologies as tools to analyze the abundance and diversity of the viral transcripts made during infection.
INTRODUCTION
When embarking on any experimental study, it is vital to carefully frame the question(s) being asked and to understand the exact nature of the information that different methodologies and approaches provide. This is especially relevant when profiling viral transcriptomes by using next-generation sequencing (NGS). Careful planning pays dividends, and four key decisions should be made at the outset. The first is whether the primary focus of the study is on viral and/or host transcripts. The second is the choice of viral strain and host model (in vivo, ex vivo, in vitro), both of which can have huge impacts on data output and the resulting biological observations and interpretations. The third is whether the goal is to document the diversity of RNAs present (define the transcript isoform landscape) or to quantify the relative abundance of specific transcripts (perform gene expression profiling), often a surrogate for the more difficult task of profiling protein products. The final decision is whether to incorporate multiple infection or reactivation time points (i.e., to profile early or late stages of infection) and to consider which time points may be optimal for a given experimental system. Qualitative measures might take the form of mapping transcription start site (TSS) usage or detecting posttranscriptional processing events such as alternative splicing or alternative polyadenylation. The activity of the virus itself can confound the interpretation of transcriptome data through the modifications of transcriptional and posttranscriptional processes (1). Examples include acute infection by orthomyxoviruses, poxviruses, coronaviruses, and picornaviruses, as well as many herpesviruses that either degrade or transcriptionally suppress host and viral mRNAs (2–10). One consequence is that the templates used for sequencing reactions are no longer translation competent, undermining the utility of RNA sequencing (RNA-Seq) for predicting the proteome. Elegant solutions abound but remain underutilized, due mainly to the issues of cost and technical complexity. These include sequencing only mRNAs loaded onto ribosomes (11, 12) or the use of specific adaptors to generate sequencing libraries limited to full-length polyadenylated RNAs (13). Lastly, it is important to keep in mind that technical and analytical requirements for transcriptome studies are different from those of genomic studies that might characterize new viruses, detect low frequency variants, or trace the origins and consequences of sequence diversity within populations (14).
In this Gem, we highlight the major challenges that can arise when studying viral transcriptomes with the leading RNA-Seq technologies. The limitations are most evident for viruses with large gene-dense double-stranded DNA genomes such as herpesviruses, poxviruses, and adenoviruses, where transcription often takes place on opposing DNA strands, where protein-coding open reading frames (ORFs) are organized as mono- or polycistronic transcription units, and where further complexity arises through the use of alternative 5′ and 3′ ends or internal splicing. In general, the methods to generate and interpret RNA-Seq data were not designed for genomes with such complexity and are better suited to analyze host transcripts. Heterogeneity within infections, resulting either from the asynchronous onset of viral gene expression (15) or the mix of infected and uninfected cells, can make the analysis even more difficult. Depending on the virus and the conditions of the infection, the relative proportions of viral and host transcripts can vary significantly, and this will influence the depth of sequencing required to achieve a robust signal for viral RNAs, a parameter that needs to be considered from the outset. Readers should be aware that while we have endeavored to cite the most pertinent peer-reviewed studies, the cutting-edge nature of several methodologies presented necessitates the citation of preprint publications. Over time, we anticipate all of these will be published and encourage readers to seek out the final peer-reviewed versions as they become available.
OPTIMIZING SHORT-READ SEQUENCING APPROACHES FOR VIRAL TRANSCRIPTOMES
The majority of short-read RNA-Seq studies document host responses to different perturbations or compare different cell types and tissues, each approach requiring a careful consideration of the experimental design to avoid batch effects and other confounding influences that are discussed in detail elsewhere (16, 17). Standard short-read RNA-Seq pipelines generate tens of millions of paired-end reads that are then aligned to the host genome and/or transcriptome. There are numerous variations of this general approach, designed to answer more specific questions such as the mapping of sites of transcript initiation (cap analysis gene expression sequencing [CAGE-Seq] [18]) or the placement of modified bases (N6-methyladenine sequencing [m6A-Seq] [19, 20]). While these variants are not dealt with in detail here (for a full review, see reference 21), many of the principles discussed will still apply.
Standard RNA-Seq is comparatively simple in terms of sample preparation and data analysis. Depending on the needs of the experiment, polyadenylated RNA (ostensibly, mRNA) is isolated from total RNA and used to construct sequencing libraries (Fig. 1A). Alternatively, the highly abundant ribosomal RNAs are removed and the remaining RNA is used for the library (Fig. 1B). The choice between these strategies is dictated by whether nonpolyadenylated host and/or viral RNAs are of interest in the study. With either option, the retained RNA is fragmented and used as a template to synthesize first- and second-strand cDNA, followed by end repair, the ligation of Illumina-compatible adaptors, and the indexing of individual samples to enable multiplexed sequencing on Illumina NextSeq, HiSeq, or NovaSeq platforms. The resulting sequence reads are aligned (22) to well-annotated genomes and processed to generate expression counts, a measure of the relative abundance of the corresponding transcript. For instance, transcripts per million (TPM) is used to specify the relative frequency of a given transcript in a population. The generation of the expression counts requires that genome annotations specify the limits of individual transcribed mRNAs, the coding sequence (CDS) within, and known splicing patterns. This is crucial to ensure reads are correctly assigned to a given transcription unit (23, 24), a step that becomes significantly more complicated where transcription units overlap. Annotations are available for the genomes of humans and the major model organisms but, more often than not, are missing or grossly oversimplified for viruses. As a result, alternative transcript structures can easily go undetected, and the presence of overlapping transcription units, which cannot be distinguished readily, can seriously confound expression level estimates and impair subsequent interpretations of the underlying biology. A partial solution is to increase the sequence read length by altering the RNA fragmentation step by using reduced temperature and fragmentation time and by increasing the number of cycles in the sequencing reaction (e.g., 200 to 300 cycles instead of the usual 75). Longer sequence reads can significantly improve the detection of splice site usage and the discrimination of transcripts originating from overlapping genes (25). Where viral transcripts are less abundant than those of the host, for instance, in clinical samples or where a low multiplicity of infection is involved, a targeted enrichment of viral nucleic acids offers an invaluable tool that can be integrated into standard RNA-Seq workflows (26). Prepared sequencing libraries are hybridized to a set of short (40- to 150-mer) overlapping biotinylated RNA or DNA probes complementary to the viral genome and isolated using streptavidin-coated magnetic beads (Fig. 1A). An important drawback is the requirement for extensive amplification by PCR to generate sufficient material for sequencing, and for this reason, it is crucial to accurately deduplicate mapped sequence reads following an alignment against a reference genome/transcriptome. Improvements in short-read RNA-seq combined with targeted enrichment can lead to exciting biological findings, such as the recent discovery of a varicella-zoster virus (VZV) latency-associated transcript, which has otherwise eluded conventional approaches to transcript identification (27).
Comparison of major RNA sequencing methodologies. Viral transcriptome profiling by NGS can be performed using either short-read (Illumina) or long-read (PacBio and Nanopore) platforms. RNA-Seq of polyadenylated RNAs (A) or total RNA after rRNA depletion (B) enables profiling at a high resolution but requires a well-curated reference genome for analysis. (C) Single-cell RNA sequencing (scRNA-Seq) enables the profiling of gene expression within tens to thousands of individual cells, but sequencing is limited to the extreme 3′ end of each RNA and is less effective if viral transcript levels are low relative to host RNAs. Long-read sequencing can be achieved using either cDNA or RNA. (D) cDNA sequencing on the PacBio platform enables full-length sequencing from 5′ cap to the 3′ RNA cleavage site but includes amplification and size selection steps than can bias outputs. (E) Nanopore arrays can be utilized to directly sequence individual polyadenylated RNAs from the 3′ polyadenylated tail toward the 5′ cap and can potentially map RNA modifications and estimate poly(A) tail lengths. High error rates require dedicated correction and analysis pipelines. (F) The choice of sequencing platform is dictated by the required depth of sequencing, as these differ markedly in the numbers and lengths of reads generated. A schematic transcriptome plot denotes how different methodologies may impact data interpretation. Here, polyadenylated transcripts (blue bars) are consistently represented regardless of protocol choice, while nonadenylated transcripts (gray bars) are underrepresented/absent when using protocols that incorporate poly(A) selection or a cDNA priming step using poly(T) adaptors. Broad indications of the general advantages (+) and disadvantages (−) of each major sequencing protocol/platform are indicated. Recoding refers to the inclusion of steps involving reverse transcription or amplification by a thermostable DNA polymerase.
SINGLE-CELL RNA SEQUENCING
It has long been understood that infection by genetically identical bacteria or viruses can give rise to different outcomes, presumably reflecting differences in the responses of individual host cell genomes. The underlying assumption in all “bulk” RNA sequencing is that cellular populations are near homogenous or at least dominated by a specific cell type so that imputed changes in host expression can be meaningfully interpreted. The reality remains, however, that many experimental systems contain heterogeneous cell populations, and the ability to dissect host and viral transcriptomes in each of these offers a powerful new approach for understanding the biology of virus-host interactions. For instance, with a mixed population of neurons, single-cell RNA sequencing (scRNA-Seq) (Fig. 1C) theoretically enables users not only to classify discrete neuronal subpopulations (28) but also to examine whether viral infections (as measured by transcription) are restricted to specific neuronal subtypes and to identify unique markers of these subtypes while also exploring how the host transcriptome reacts to the presence of the virus.
The simultaneous interrogation of host and microbial transcriptomes within a single cell is fast becoming a reality, albeit tempered by some critical limitations (29). scRNA-Seq protocols vary in the degrees of throughput and sensitivity (reviewed in detail in reference 30). In general, the approach requires the sorting of dissociated cells into individual wells on a plate or chip or into individual oil droplets, where they can be mixed with barcoded primers that enable the conversion into cDNA. Each primer sequence incorporates a cell-specific barcode so that all subsequent reads from that one cell can be analyzed together (31). This approach frequently incorporates the use of unique molecular index (UMI) sequences, which enable the exclusion of duplicated sequence reads arising during PCR amplification from later analyses (32). Following lysis and first-strand synthesis, the samples are pooled and the final sequencing libraries are constructed. Paired-end sequencing results in one sequence read containing the cellular barcode (and UMI) and the other containing a short span of sequence mapping to the 3′ end of a given mRNA. Sequence alignment is often performed using STAR (22) or Kallisto (33), either as part of commercial (e.g., 10x Genomics Cell Ranger) or custom pipelines. Subsequent analyses aimed at identifying and stratifying host cell types by the expression of one or more markers (e.g., beta-tubulin III for neuronal lineage cells) and profiling differential expression are readily performed using any ever-growing list of tools, including Seurat (34), Monocle (35), and MAST (36), many of which are accompanied by excellent tutorials.
However, as with bulk RNA sequencing approaches, the current protocols for scRNA-Seq are optimized toward analyzing host mRNAs. The use of 3′ sequencing remains problematic for many viruses because of incomplete genome annotations, the presence of polycistronic gene arrays, and the presence of nonpolyadenylated RNAs, all of which can lead to valid sequence reads being erroneously removed or misassigned. Moreover, mapping and subsequent analyses of sequence reads generally require merging of the host and viral reference genomes to ensure that viral reads are retained and assigned to the correct host cell. Thus, there is a need to generate alternative viral genome annotations in which polycistronic gene arrays are collapsed into transcription units. Studies of herpesvirus latency or other low-abundance viral infections remain challenging, because viral mRNA abundance may often be below the level of detection in single cells and because viral markers of latency are not necessarily polyadenylated, as is the case for the stable intron derived from the herpes simplex virus (HSV) latency-associated transcript (37). Given the pace at which this field is progressing, many of these problems will likely be overcome and soon (30, 38); however, caution must be exercised in the experimental design and interpretation, with an awareness that off-the-shelf bioinformatics solutions are rarely suited to examining host-virus interactions.
RISKS AND REWARDS OF LONG-READ SEQUENCING
The concept of sequencing full-length RNAs (originally, as expressed sequenced tags [ESTs]) can be traced back to the early 1980s (39), and as a technique, this has continued to evolve in step with technological advancements in sequencing technologies (40). Today, the current iteration of long-read RNA-Seq enables the sequencing of polyadenylated mRNAs from the 3′ poly(A) tail toward the 5′ cap. While producing comparatively low numbers of reads when compared to that from Illumina sequencing, the generation of long sequence reads obviates the need to computationally stitch together sequence fragments in order to reconstruct the original transcripts. Long-read RNA-Seq has been used to catalog transcript variation through alternative splicing and to identify novel transcripts or transcript isoforms (41, 42). More importantly, when combined with short-read RNA-Seq and/or variant approaches such as CAGE-Seq, it enables fine detailing of viral transcriptomes at a very high resolution (43–45).
Currently, there are two different options for long-read sequencing. Single-molecule real-time (SMRT) sequencing of cDNA using the Pacific Biosciences (PacBio) platform (Fig. 1D) represents the most popular approach but faces stiff competition from nanopore array sequencing (Oxford Nanopore Technologies MinION platform) of either cDNA or, most excitingly, the RNA itself (Fig. 1E) (13). While SMRT sequencing is well established, the relative complexity of constructing the libraries and the physical size of the PacBio sequencer requires most users to work with a core facility to generate the data. In contrast, the nanopore MinION has a very small footprint, can be run locally while attached to a standard laptop or desktop computer, and offers simple yet rapid library construction protocols. The sequencing of RNA without the conversion to cDNA is termed direct (dRNA-Seq) or native (nRNA-Seq) RNA sequencing (46).
Both approaches are currently highly constrained by the amount of starting material required (generally >500 ng of polyadenylated RNA) and produce comparatively few reads (less than one million reads per run), which limits the depth of sequencing. Likewise, both suffer from high error rates (47), although these are lower for SMRT sequencing, which also benefits from the dual capture of the 5′ cap and 3′ poly(A) tail, enabling the accurate mapping of transcription start and RNA cleavage sites. These advantages must be offset against a more involved sequencing protocol, which includes reverse transcription and PCR steps and the need to size-select fragments prior to sequencing. In contrast, the direct sequencing of polyadenylated RNA by using the nanopore arrays (MinION) platform combines a simple library preparation protocol (<2 h) with an overnight sequencing run. Here, a sequencing adapter is ligated to the poly(A) tail, enabling first-strand synthesis to produce a stable RNA:cDNA hybrid. A motor protein is then attached to the polyadenylated RNA strand, which is unwound from the cDNA and guided through protein nanopores embedded in a membrane. As each nucleotide is drawn through the pore, it disrupts the current, enabling the sequence to be read. This represents the most unbiased approach to RNA sequencing, as each individual read is generated from an individual polyadenylated RNA, avoiding any amplification steps.
Although a single nanopore MinION run can generate upwards of double the number of reads as SMRT sequencing, the error rate is, currently, notably higher (47). The accurate mapping of sequence reads first requires error correction, a complex proposition, to accurately map the extreme 5′ and 3′ ends of transcripts, as well as accurately identify sites of splicing. While aligning reads to a reference genome is relatively simple following the development of MiniMap2 (48), custom pipelines are often required to identify transcription start and RNA cleavage sites (44). A visual inspection of the data is crucial for identifying novel genes or splice variants, and users should be particularly aware of sequencing artefacts (signal loss/interruption) on nanopore platforms that can masquerade as excised introns. As base-calling and error-correction techniques improve, it seems likely, on the basis of the overall ease of use and affordability, that nanopore sequencing will become the favored long-read sequencing approach for transcriptomic studies. Enhancements such as the ability to estimate poly(A) tail lengths and to identify specific RNA modifications (such as N6-methyladenosine) are fast becoming a reality (49). Likewise, the ability to design custom adapters targeting RNA populations of interest will broaden the sequencing capabilities beyond polyadenylated mRNAs, as evidenced by the recent direct genome sequencing of RNA viruses such as influenza A virus (50).
As a final point, it should be kept in mind that, due to the comparatively small number of reads generated during each run, long-read sequencing is less useful when viral RNA yields are low. Likewise, the requirement for micrograms of total RNA as the input material will limit the infection models that are compatible with current long-read RNA-Seq methodologies. Whether this will be improved by the application of targeted enrichment approaches remains to be seen, although in the case of nanopore sequencing, this problem might be circumvented by use of higher capacity platforms such as GridION and PromethION. These offer far greater numbers of sequence reads per run, although at the time of writing, neither is compatible with direct RNA sequencing.
ROLE OF THE BIOINFORMATICIAN
Nowadays, many research labs have either direct (integrated into the research group) or indirect (as a collaboration or core facility) access to bioinformaticians who can turn raw sequencing data into lists of regulated genes supported by statistical significant values (P values) corrected for multiple testing. To ensure success, it is crucial to involve these individuals in the planning stages so that subsequent analyses can be tailored to the viral genome of choice and to avoid the pitfalls that come with applying “one size (does not) fits all” approaches. Planning discussions should address issues of reproducibility (biological replicates), batch effects, availability, and the quality of gene annotations for the organism(s) of interest. Establishing and optimizing analytical pipelines using test data sets prior to generating the final experimental data sets can also help to identify and preempt critical issues that might otherwise necessitate an experimental redesign and resequencing, an expensive proposition in both time and money. It is also critical that bioinformaticians move away from standard RNA-Seq analysis pathways when dealing with viruses and, with guidance, become aware of the biological characteristics and genome structure of the virus being studied. For instance, the existence of polycistronic arrays in herpesviruses limits scRNA-Seq gene expression analyses, because all transcripts generated across the polycistronic unit share the same 3′ end. This can significantly impact the alignment of viral reads to a transcriptome, because many scRNA-Seq software packages will by default discard reads that map identically against the 3′ ends of multiple transcripts. Thus, it is necessary to represent polycistronic genes that share the same 3′ ends as a single transcription unit, which diminishes the yield of biologically relevant information. Naturally, it is also critical that the 3′ ends of these transcription units are accurately mapped prior to embarking upon scRNA-Seq projects; otherwise, meaningful biological data will likely end up being discarded. This is a frequent problem with viral annotations specifying only the boundaries of the coding sequences (ORFs) rather than the transcript as a whole. The reads obtained during scRNA-Seq are typically limited to the 3′ untranscribed region (UTR) and may not be correctly assigned to a recognized gene. Another potential confounder is that many viral genomes contain duplicated regions which can result in short or long sequence reads being automatically discarded (no single mapping location) or their distribution distorted (sequence reads not allocated correctly between duplicated units), which can influence resulting TPM counts.
Another important consideration is that while viral genomes exhibit a wide range of sizes, they are orders of magnitude smaller than the genomes of their hosts. This enables the generation of linear or circular genome-wide coverage plots using R packages (e.g., Gviz [51] or Circos [52]) that provide an easy-to-assimilate overview of transcription patterns across the viral genome. It is reasonably straightforward, and crucial, to examine aligned sequence data by loading full data sets into visualization tools such as IGV, the UCSC genome browser, Tablet (53–55), or through the use of R packages such as Gviz (51) and ggbio (56). Using these graphical outputs, the read data can be quickly inspected and analyzed against current gene annotations, as this may reveal areas within the genome that were not previously known to be transcribed or that show evidence of alternative transcript structures. This can provide the first clue to novel coding units or noncoding RNAs that were missed by previous annotators focused on identifying sizeable single-exon open reading frames.
WHERE WE ARE AND WHERE WE ARE GOING
The ability to directly sequence full-length RNAs within individual infected cells while retaining spatial and temporal information seems like science fiction. However, the speed at which nanopore and single-cell transcriptome sequencing technologies are developing seems certain to make this a reality, and soon (57). Applying these methodologies to understanding virus-host interactions remains a formidable challenge, but the increasing integration of computational biologists into experimental biology labs raises the prospects of many exciting breakthroughs for virology using NGS methodologies. The use of NGS approaches to follow sequence variation within viral populations is advancing at a breath-taking pace, and it seems inevitable that studies of viral gene expression will follow a similar trajectory. The ability to enrich for viral transcripts and perform full-length sequencing of RNA without cDNA conversion or PCR amplification will be a game changer, especially if this can be performed at the single-cell level.
The decision to use short-read or long-read sequencing approaches remains complex and is often influenced by whether or not the study addresses changes to the host transcriptome or focuses on the virus and whether the profiling of the polyadenylated RNA fraction is sufficient for the experimental goals. Having an adequate quantity and quality of starting material is also a major factor, as the power of long-read sequencing is nullified if RNA yields are low or there is significant RNA degradation. While long-read sequencing presents fresh challenges in terms of read alignment, experimental validation, and, ultimately, the determination of the biological significance of rarer transcript isoforms, we believe this to be advantageous to short-read sequencing when seeking to analyze complex viral transcriptomes, where sufficient material is available. We further anticipate that the speed of developments within the long-read sequencing field will soon yield new approaches to working with smaller amounts of input RNA and/or incorporating steps that enable the enrichment of viral transcripts, although these will require careful evaluation and optimization. The relatively low cost of long-read sequencing approaches makes the integration of both short- and long-read sequencing methodologies an affordable option, maximizing the benefits of both, especially when error correction remains key to the analysis of long-read sequencing data.
Just as NGS technologies are revolutionizing other fields, including molecular epidemiology, pathogen surveillance, and cancer biology, it is incumbent on the wider virology research community to embrace the most recent viral genome annotation/reannotation projects (58–61) and, most importantly, to incorporate the findings of new transcriptional profiling work into ongoing studies.
ACKNOWLEDGMENTS
We thank Cristina Venturini, Werner J. D. Ouwendijk, and the two anonymous referees for providing valuable feedback on the manuscript.
This work was supported in part by grants from the NIH (AI073898, GM05692, and AI130618).
FOOTNOTES
- Received 31 August 2018.
- Accepted 4 October 2018.
- Accepted manuscript posted online 10 October 2018.
- Copyright © 2018 American Society for Microbiology.