Low-Bias RNA Sequencing of the HIV-2 Genome from Blood Plasma

An accurate picture of viral genetic diversity is critical for the development of a globally effective HIV vaccine. However, sequencing strategies are often complicated by target enrichment prior to sequencing, introducing biases that can distort variant frequencies, which are not easily corrected for in downstream analyses. Additionally, detailed a priori sequence knowledge is needed to inform robust primer design when employing PCR amplification, a factor that is often lacking when working with tropical diseases localized in developing countries. Previous work has demonstrated that direct RNA shotgun sequencing (RNA-Seq) can be used to circumvent these issues for hepatitis C virus (HCV) and norovirus. We applied RNA-Seq to total RNA extracted from HIV-2 blood plasma samples, demonstrating the applicability of this technique to HIV-2 and allowing us to generate a dynamic picture of genetic diversity over the whole genome of HIV-2 in the context of low-bias sequencing.

KEYWORDS HIV-2, RNA sequencing, next-generation sequencing, vpx, whole genome H uman immunodeficiency virus types 1 and 2 (HIV-1 and HIV-2), the two causative agents of AIDS, are human pathogens of high importance (1). Following the introduction of HIV-1 and HIV-2 into human populations through zoonotic transmission of simian immunodeficiency viruses (SIVs) infecting several species of apes and nonhuman primates, HIV-1 and HIV-2 are estimated to have infected more than 75 million people worldwide, resulting in over 40 million deaths (2).
While HIV-1 and HIV-2 share some common features, a major difference between the two viruses is the typical viral load (VL) associated with chronic infection. In patients infected with HIV-2, viral load is strongly correlated with disease progression, and a large proportion (ϳ37% in the Caió cohort) maintained undetectable viral loads and high CD4 counts in the absence of treatment during follow-up (sometimes for more than two decades) (3). In addition, based on previous data from this cohort, subjects with VLs of Ͼ10,000 were regarded as having a high likelihood of subsequent progression (4). Indeed, patients with a viral load of more than 10,000 copies/ml have been suggested to be HIV-2 progressors with a reduced survival probability that is similar to that seen in HIV-1-infected individuals in the absence of treatment (5). Furthermore, lack of HIV-2 control is associated with lower viral loads than for HIV-1 in patients matched by disease stage (4,(6)(7)(8).
HIV-1 disease progression has also been correlated with viral coreceptor use or molecular properties like glycosylation patterns, charge, and length of the envelope gene (9)(10)(11)(12). Although cytopathic CXCR4-using virions have been isolated from HIV-2infected individuals in late-stage disease (13,14), less is known about correlations between molecular properties and disease stage in HIV-2 infection, particularly outside the envelope gene (15,16). One of the main barriers to a globally protective HIV vaccine is the ability of HIV to evolve rapidly, introducing mutations that abrogate the binding of neutralizing antibodies, rendering vaccine responses ineffective (17). Therefore, a major focus of HIV research has been to understand the factors affecting viral evolution and to identify viral epitopes of high conservation as potential vaccine targets (18).
Due to the relatively low copy number of the small-sized HIV single-stranded RNA (ssRNA) genome (ϳ10,000 bases), target enrichment is normally required prior to sequencing in order to generate sufficient DNA for downstream sequencing applications (19). The most common method of target enrichment is PCR amplification (20). This method has two major drawbacks. The first is the requirement for detailed a priori sequence knowledge to inform robust primer design that ensures that the majority of variants in the viral quasispecies are captured (21). Different amplification strategies have shown sensitivities down to 3,000 copies/ml, demonstrating the difficulty of generating robust and high-depth sequence data from patients without detectable plasma viremia (22,23). However, the sequence database for HIV-2 is significantly smaller than for HIV-1, and a robust and sensitive pan-HIV-2 primer set has yet to be defined and thoroughly evaluated. Mutations in primer binding sites can also reduce binding efficiency and therefore alter the proportion of specific variants in the final pool of amplicons or, in extreme cases, abrogate primer binding completely, resulting in the loss of that variant in the final analysis (24). The second drawback is that PCR is stochastically biased by amplicons from previous cycles acting as templates in the subsequent amplification cycles with the potential to further distort the picture of the viral diversity (25).
Several methods have been proposed to circumvent these problems and reduce the biases introduced into sequencing data through target enrichment. For example, primer identification (ID) allows identification of reads derived from the same viral template through incorporation of a unique 8-mer tag during the reverse transcription of viral RNA (26). Downstream reads can be pooled according to template, and multiple reads from the same template can be used for error correction. A study using primer ID observed biased diversity estimates between 2-and 100-fold compared to a library generated without any PCR bias correction, highlighting the importance of considering this factor when sequencing a highly diverse population, such as HIV (26). However, primer ID still relies on sufficient a priori sequence knowledge to allow robust primer design, and the incorporation of the barcode into the 3= end of the cDNA molecule means that it is not applicable to library preparation techniques involving random fragmentation of the target, such as those employed when using Illumina platforms.
Shotgun RNA sequencing (RNA-Seq) has been demonstrated as a powerful tool for the study of RNA viruses (27). Library preparation is performed using random-hexamer priming of the total RNA in a sample, negating the need for sequence-specific target enrichment (28). This is particularly desirable for HIV-2, for which the sequence data available are significantly limited compared with those for HIV-1. Few studies have applied RNA-Seq to human RNA viruses. For example, Ninomiya et al. applied RNA-Seq to plasma samples taken from two chronically hepatitis C virus (HCV)-infected patients and demonstrated nearly full-length genome sequences with a mean depth of coverage between 50ϫ to 70ϫ for the two patients (29). In another study, Batty et al. further expanded this method, presenting a high-throughput method for norovirus sequencing allowing 77 fecal samples to be sequenced, with a mean depth of coverage of 100ϫ and a success rate of more than 99% (30). The authors compared this with a PCR amplification strategy and found that the success rate for whole-genome amplification using PCR was 29%. This represents a significant decrease in the performance compared to that of RNA-Seq. RNA-Seq has also been used for the discovery of two novel SIVs, demonstrating the power of this method of sequencing without prior sequence information in viral discovery (31).
In the present study, we applied RNA-Seq library preparation methods to both patient plasma samples taken from a rural West African community cohort and cultured lab-adapted HIV-2 reference strains. We show that RNA-Seq followed by de novo assembly is a feasible and powerful approach when applied to HIV-2 samples with viral loads of at least 5,280 copies/ml. In addition, we demonstrate that RNA-Seq represents a novel, low-bias method of HIV-2 sequencing. Finally, we computed estimates of nucleotide diversity for each gene of HIV-2 on both the intra-and interhost levels. These analyses indicated consistently low estimates of diversity in the accessory gene vpx within hosts, highlighting the importance of this HIV-2-specific gene in successful HIV-2 infection.

RESULTS
Patient and sample characteristics. Samples from a panel of six members of the Caió HIV-2 community cohort (TD003, TD006, TD013, TD024, TD031, and TD062), whose plasma viral loads represented the broad spectrum seen in natural HIV-2 infection, as well as cultures of two lab-adapted HIV-2 strains (HIV-2 ROD and HIV-2 CBL20) were subjected to standard RNA-Seq library preparation ( Table 1).
Assessment of RNA-Seq using HIV-2 ROD. First, we assessed the performance of RNA-Seq using the well-characterized reference strain HIV-2 ROD. Following initial quality control and removal of low-quality reads and adaptor contamination, reads were assessed for the presence of biased random-hexamer priming. The randomhexamer analysis indicated that a random-hexamer bias affected the first 13 bp of the read. The remaining high-quality reads were assembled to the HIV-2 ROD reference genome sequence (accession number BD413542). The mean depth of coverage over the whole genome was around 2,000ϫ for all alignment tools, with GSNAP having the highest mean depth ( Table 2). All four alignment tools produced a slightly positive GC bias, and more GC-rich regions tended to have higher coverage. The slopes were very similar (0.79 to 0.91), implying that the assembly algorithm used does not affect the GC bias. In order to assess how divergent the HIV-2 ROD that was propagated for the present study was from the published reference sequence, polymorphisms that were fixed at a frequency of Ͼ95% in the sample population were annotated as single nucleotide polymorphisms (SNPs) using VarScan (Fig. 1). The BWA-SW build was used for this analysis, as it agreed with the majority consensus at each site of conflict. All genes except vif had SNPs (in total 70 SNPs), and the majority of SNPs were seen in gag, pol, and nef. However, when corrected for gene length, nef showed the greatest contribution to divergence from the reference genome. De novo genome assembly and factors influencing RNA-Seq success rate. After showing that our RNA-Seq approach could be used for whole-genome sequencing of a high-copy number and lab-adapted HIV-2 strain, we assessed the feasibility of using RNA-Seq to generate whole-genome sequences directly from primary patient blood plasma samples (Table 1). Clinical blood plasma samples often contain significant amounts of human RNA, making it challenging to perform de novo assembly of minority species (such as HIV-2). VICUNA is designed to target populations with high mutation rates and map minority variants into a single consensus sequence and is therefore particularly suitable for HIV-2, considering the few publicly available HIV-2 wholegenome sequences. In addition, since HIV-2 blood plasma samples usually have significantly lower viral copy numbers than propagated virus isolates, we included a lab-adapted HIV-2 strain derived from a Gambian subject (CBL20), for which the whole-genome sequence is unknown, as a high-viremia control (Table 1).
In total, whole-genome assembly was successful for three of the six patient samples (TD024, TD031, and TD062) and the control (CBL20). Successful patient samples showed complete capture of the protein-coding region of HIV-2 and merged contigs ranged from 9,397 to 9776 bp in length. A merged contig spanning the complete coding region was also assembled for CBL20, demonstrating the applicability of RNA-Seq to both in vitro and ex vivo samples. Although we are unable to provide a formal cutoff value, these results suggest that a viral load of at least 5,000 copies/ml is needed for successful sequencing, with an expectation of at least 0.001% HIV-derived RNA (Table  3). When these limits are considered, the success rate was 75%.
In order to assess how well de novo assembly using VICUNA had captured the HIV-2 genome, consensus sequences were aligned to the commonly used HIV-2 group A reference sequence UC2 (accession number U38293) and annotated according to homology (Table S1). In patient sample TD031, the first 177 bp of the gag leader sequence was missing, whereas the long terminal repeat (LTR) region lacked coverage for patient samples TD024 and TD062 and the reference strain CBL20. Phylogenetic analysis of de novo genome sequences. A BLAST analysis of the newly generated sequences indicated highest similarity with HIV-2 group A sequences. Bayesian phylogenetic analysis with the 20 publicly available HIV-2 group A near full-genome sequences confirmed this ( Fig. 2; Table S1). This analysis also showed that the newly generated sequences were clearly distinguishable from existing reference sequences. As expected, the HIV-2 ROD sequence generated in the present study and the published reference sequence were closely related and clustered together, with a posterior probability of 1.
Read remapping to the patient-specific consensus whole-genome sequences. In contrast to resequencing projects, in which a high depth of coverage is required for error correction, deep sequencing of pathogen populations uses high depth of coverage to gain a picture of the diversity in the population as a whole (27). Following de novo assembly of a patient-specific consensus genome sequence, we assessed the performance of four commonly used alignment tools when remapping reads to the patient-specific consensus ( Table 2). Read remapping was performed using the total reads without prior HIV-2 enrichment or digital subtraction of human sequences to allow an assessment of how these tools perform in the context of a high level of contamination. This is likely to be a factor of all pathogen sequencing strategies employing RNA-Seq. Mean depth and range of coverage were compared for each aligner (Fig. 3). These results show consistent performance of the four aligners, with mean depths of coverage ranging from 28ϫ to 67ϫ for the three patient samples. This range is in line with previous RNA-Seq studies, showing that RNA-Seq is a feasible tool for generating HIV-2 whole-genome sequences. Additionally, the high similarity indicates that read mapping is robust and repeatable irrespective of which alignment a Absolute viral input estimated from viral load. b Total RNA input used for library preparation. c Estimated using a viral genome length of 10 kb and absolute viral input.

FIG 2
Bayesian phylogeny of HIV-2 genome sequences generated in the present study. Eighteen whole-genome HIV-2 group A sequences were included as a reference set (Table S1). Reference sequences are colored according to country of origin, and sequences generated in the present study are shown in red. Bayesian posterior probabilities are included on the corresponding nodes, and the scale bar represents the number of nucleotide substitutions per site.
algorithm is employed following de novo assembly of patient-specific consensus sequences. Error rates were consistent over cycles, and there was no evidence of a drop in accuracy over the length of the read (Fig. 4). Assessment of the random-hexamer bias. A commonly recognized bias that is specific to RNA-Seq protocols is the random-hexamer bias (32,33). Hypothetical differential binding affinities between different random hexamers result in biased nucleotide composition at the 3= ends of the reads, normally spanning 7 to 13 bp. In line with previous studies and our initial assessment of RNA-Seq using HIV-2 ROD, the random-hexamer analysis indicated that a random hexamer bias affected the first 13 bp of the read (Fig. S1). The pattern of the bias was similar in all three patient samples and

HIV-2 RNA Sequencing
Journal of Virology the control (CBL20), suggesting that there may be preferential binding to the same motifs in all samples. This biased read composition can be attributed to randomhexamer bias rather than low-quality sequencing at the end of the reads, as the median Q-score was constant over the length of the read. The effect of the bias did not extend past the first 13 bp of each read, and the nucleotide composition stabilized after this point. A correction was not applied to account for the biased nucleotide composition of the first 13 bp, as removal of these positions does not remove the effects of this bias seen in downstream analyses. Quantification of the GC bias and depth of coverage as a function of genomic context. Depth of coverage in samples sequenced using Illumina short-read chemistry can be affected by the local GC content of the genome (34). We assessed the effect of local GC content on depth of coverage using a custom script which took a sliding window of 50 bp, with a step size of 20 bp, and calculated percent GC and mean depth of coverage in each window. The extent of the GC bias was quantified using the slope of the linear regression line, and the bias was assessed for each aligner individually (Table 4). To further compare the different aligners, the mean depth of coverage was normalized in each window using the genome-wide mean depth of coverage (Fig. 5). All assemblies showed a slight, positive GC bias, suggesting that GC-rich regions had a depth of coverage that was higher than the mean. For patient samples TD024 and TD031, the magnitudes of the slope were similar for all four aligners, suggesting a constant effect when different assembly algorithms were employed. In contrast, sample TD062 showed more fluctuation between aligners. However, the magnitude of the bias was lowest for this patient, suggesting that the overall effect of the GC bias would be reduced, in spite of the fluctuations. Hence, a positive GC bias in HIV-2 samples sequenced using RNA-Seq may confer variability in depth of coverage over the genome. However, the magnitude of the bias was in line with previous studies and did not show a loss of coverage of any genomic regions due to GC bias (34).
In order to assess whether genomic context could affect depth of coverage, the HIV-2 genome was partitioned according to gene and mean depth of coverage was compared for each gene individually. The effect of genomic context on depth of   coverage was visualized by plotting mean depth of coverage as a function of GC content for each gene (Fig. 6). All aligners showed a similar pattern of coverage and no consistent loss of coverage in any genomic region.

No general trends in molecular properties between the analyzed HIV-2 strains or correlations with clinical stage.
To characterize the molecular properties of the newly generated sequences and to put them in a broader perspective, we performed an in-depth analysis of these and the 20 selected and publicly available HIV-2 group A sequences. Associations between molecular and biological properties were assessed by available clinical and epidemiological data (Table S1; Fig. S2). All analyses were performed per HIV-2 gene. In the data set there were two occasions of duplicate origin, i.e., two sequences that had been generated from the same original patient sample (Table  S1; Fig. S2, RODR and A.SN.ROD plus A.JP.NMC786_41 and A.JP.NMC786_41). These were counted only once when assessing associations between molecular and biological properties between sequences collected during the asymptomatic versus AIDS stage of disease. No significant differences in sequence length, net charge, total charge, or number of potential N-linked glycosylation sites (PNGS), in any of the nine HIV-2 genes, were found between sequences from asymptomatic patients (n ϭ 6) and AIDS stage patients (n ϭ 12) (Table S1; Fig. S2). Prediction of coreceptor tropism based on the env gp120 V3 region indicated that 50% (3 of 6) and 58% (7 of 12) of the participants had CXCR4-using viruses in the asymptomatic and AIDS stages, respectively (P ϭ 1.00, two-tailed Fisher's exact test) (Table 5). Furthermore, we found no diagnostic motifs or amino acids between asymptomatic and AIDS stage patients in any of the nine HIV-2 genes (Table S1; Fig. S2).

Genome-wide estimation of genetic diversity in HIV-2 in the context of lowbias sequencing.
To determine how the diversity varies over the HIV-2 genome, we estimated nucleotide pairwise diversity from assembled reads (Bowtie2 assembly) using a custom script. Raw estimates of diversity of the whole genome were 0.0010 substitution/site for TD024, 0.0007 substitution/site for TD031, and 0.0014 substitution/site for TD062. In comparison, the raw estimates for the env gene were 0.0013 substitution/ site for TD024, 0.0008 substitution/site for TD031, and 0.0020 substitution/site for TD062. To compare the relative genetic diversity between the patients, we normalized the raw estimates using the genome average (Fig. 7). Overall, our analysis showed similar results between patients, with the highest level of within-host diversity seen in env for all three patients, whereas the lowest diversity was seen in the vpx and rev genes. Interestingly, the diversity in pol seemed to be higher than for the genes gag, vpx, tat, rev, and vif for all three patients.
To compare the above-described results of intrahost viral diversity in different HIV-2 genes with viral diversity in different HIV-2 genes between hosts, we performed a phylogenetic bootstrap analysis of our newly generated whole-genome sequences and the reference sequences. This analysis showed that, similar to the intrahost viral diversity above, the env gene was the most diverse gene, followed by the nef gene. However, in contrast to the intrahost analysis, this analysis indicated that pol was the second least diverse gene between hosts (only vif was less diverse [Fig. 8]).

DISCUSSION
Deep sequencing of HIV offers unparalleled opportunities to gain a high-resolution picture of the nature and diversity of the viral quasispecies in a single patient. Our study presents a novel pan-HIV-2 whole-genome amplification strategy using RNA-Seq, allowing the entire protein-coding region of HIV-2 to be sequenced without the need for detailed a priori sequence knowledge. We show a broad applicability of this method, presenting data from both lab-adapted isolates and patient plasma samples. To our knowledge, only one previous study has used a next-generation sequencing approach to determine the full genome of HIV-2 (35). However, we used HIV-2 isolates propagated in cell culture prior to library preparation and aligned the generated sequence reads to a common reference strain (HIV-2 BEN). We analyzed patient samples and were able to successfully sequence samples of viral loads down to 5,280 copies/ml and an expectation of at least 0.001% HIV-2 RNA in the sample. When these conditions were fulfilled, the success rate was 75%, which is lower than previously reported by Batty et al. when applying RNA-Seq to norovirus. However, the lower HIV-2 plasma viral loads of the patient samples used in the present study readily explain this reduced success rate. A putative cutoff of approximately 5,000 copies/ml restricts this method to viremic HIV-2 patients, and it is possible that an alternative approach would be needed for  TD024  34  5  5  1  N  N  N  N  R5  TD031  34  5  5  1  N  N  N  N  R5  TD062  34  5  5  1  N  N  N  N  R5  CBL20  34  6  6 a Coreceptor tropism was assessed as described in reference 59. b Net charge of sequences was determined based on each lysine and arginine contributing ϩ1 and each aspartic acid and glutamic acid contributing Ϫ1. c Total counts of amino acids were also assessed as described previously (58). d Number of potential N-linked glycosylation sites (PNGS) as defined in N-GLYCOSITE (37). e N, no; Y, yes. f R5, CCR5-using virus; R5X4, CCR5-and CXCR4-using virus.

FIG 7
Nucleotide diversity plot. Nucleotide site diversity estimates per gene, normalized to the wholegenome estimate. Diversity was estimated for samples TD024, TD031, and TD062. Calculation of the diversity relative to the whole-genome estimate was performed to allow a comparison between patients.
samples with lower viral loads. However, we anticipate that RNA-Seq could also be successfully applied to samples taken from untreated HIV-1 patients, where the typical viral load is 10 to 1,000 times higher than for HIV-2. While RNA-Seq allows whole-genome sequencing of HIV-2 without the need for detailed prior sequence knowledge, the lack of sequence-specific target amplification also leads to a reduction in the use of PCR amplification and its resulting biases, thereby generating sequence data that are more representative of the true population frequencies. In this study, we aimed to quantify the other biases known to be associated with RNA-Seq. We found evidence of a moderate positive GC bias which varied between samples but was consistent when different aligners were used. We also found evidence of a biased nucleotide composition in the first 13 bases of the reads, suggesting the presence of non-random random hexamer priming. Although these biases could be responsible for the fluctuations in coverage over the genome, we observed no correlation between genomic location and depth of coverage. This suggests that these fluctuations were randomly distributed and not due to the various diversities seen in different functional genomic sites.
While patient consensus sequences contained all nine genes of HIV-2 in intact reading frames, there was some variability in the assembly of the 3= and 5= LTR and the gag leader sequence. In patient sample TD031, the loss of the first 177 bases of the gag leader sequence could be attributed to the failure of the RNA-Seq library preparation method to capture this region. The initial fragmentation step in library preparation can lead to the loss of distal regions of the RNA molecule, and this is the most probable cause of the lack of coverage in this genomic region. For patient samples TD024 and TD062 and the reference strain CBL20, the lack of coverage was probably due to the nature of the LTRs in HIV-2. The 5= and 3= LTR regions only exist as true 990-base repeats in the proviral form of the virus, whereas in the RNA genome, the 5= LTR comprises the R and U5 regions and the 3= LTR is composed of the R and U3 regions (36). The sequence alignment used during assembly contained HIV-2 sequences from both cDNA and RNA HIV-2 genomes, so assembly was conducted using "complete" LTRs, both containing U5, R, and U3 (37). Ambiguous read mapping is normally resolved by using the location of the read mate to provide information on the most likely coordinates. In the present study, the insert size (250 to 350 bases) and the nature of the LTRs made mapping reads for the R region problematic, as the read mate will also fall in the LTR. Therefore, it was not possible to resolve the correct orientation of the reads, resulting in the loss of coverage of one LTR.
The ability to sequence the whole HIV-2 genome in a single experiment allowed us to compare pairwise nucleotide diversities between the different HIV-2 genes. A study comparing outcomes between HIV-1-and HIV-2-coinfected and HIV-1-monoinfected subjects followed from early infection showed that the extent of HIV-1 genetic diversity strongly correlated with time to AIDS (38). Much less is known about genetic diversity in HIV-2 infection. In our small study of three subjects, all of whom presented with raised viral load, the patterns of within-host diversity were similar in all three. The highest diversity was seen in env, which is in keeping with previous observations in HIV-1 infection. Diversity in partial fragments of HIV-2 env has previously been estimated through different approaches, and although on the lower side, our estimated intrahost env diversities were in the same range as those from previous studies, which used molecular cloning for sequence generation (15,39,40). Similarly, a high level of nef diversity was seen in all three patients. In HIV-1 infection, pol is thought to be highly conserved for functional reasons and therefore typically shows a relatively lower diversity than, for example, the env gene (41). In contrast, we observed a high level of within-host HIV-2 pol diversity in all three subjects studied in this investigation. Interestingly, a recent study has shown a high level of within-host diversity in pol following vertical HIV-1 transmission (42), although vertical transmission of HIV-2 is uncommon and is unlikely to be implicated in our study subjects. There are some potential caveats of our intrahost diversity analysis. (i) HIV-2 diversity has been reported to increase over the course of infection (15,39). It is possible that parameters such as the duration of infection or the mode of transmission influenced the diversity levels in our study, but these parameters are not known for the three study subjects. (ii) For some single nucleotides over the genome, the coverage was less than 20 sequence reads, and from a sample perspective, the depth of coverage was positively correlated with the viral copy number. On the one hand, low coverage may underestimate the true genetic diversity. On the other hand, some regions of the genome are evolutionarily conserved, and only a limited number of virus variants can theoretically coexist in a sample taken from a subject with a low viral load.
vpx is an HIV-2/SIVmm specific accessory gene that is entirely absent from the HIV-1/SIVcpz lineage. The main role of vpx is antagonism of the host restriction factor SAMHD1, which blocks reverse transcription of viral RNA in slowly dividing cells, such as macrophages and resting CD4 ϩ T cells (43). Our observation of a consistently low level of vpx diversity may indicate a high level of conservation in vpx, suggesting that vpx has a critical role in HIV-2 pathogenesis. The implications of vpx diversity in HIV-2 infection are not yet clearly defined, but a recent study by Yu et al. identified an SNP in a vpx allele derived from a viremic patient that totally abrogated the ability of vpx to promote SAMHD1 degradation in vitro (44).
In conclusion, we show that RNA-Seq library preparation methods can be applied to HIV-2 blood plasma samples. Resulting de novo genome assemblies captured the entire coding region of HIV-2 in intact open reading frames and read remapping allowed us to demonstrate the importance of a two-step analysis pipeline. In the context of a highly diverse retrovirus, such as HIV-2, the selection or generation of an appropriate reference sequence is a critical first step, allowing robust and repeatable downstream read mapping. We also demonstrated a low level of GC and random-hexamer bias and, in the absence of sequence-specific target amplification, showed that RNA-Seq offers a method of whole-genome HIV-2 sequencing in a low bias context. However, some challenges in RNA-Seq remain. For example, although the sequencing costs have fallen dramatically in recent years, RNA-Seq is still expensive and costs continue to be a barrier to an even more widespread adoption. In the present study, we multiplexed six patient samples using the Illumina HiSeq in order to reach a mean depth of coverage of up to 67ϫ. Although this coverage is more than sufficient for consensus sequence calling, it may be too low if the primary goal is to determine minority variants (at least in samples with high viral loads). Hence, the importance of developing novel and low-bias HIV sequencing protocols cannot be overstated, as the ability to gain a complete and accurate picture of HIV genetic diversity is critical to the development of globally effective and preventative HIV vaccines.

MATERIALS AND METHODS
Patient sample collection. All patient samples used in the present study were collected from members of the Caió community cohort who had provided written and informed consent. Samples were collected prior to the start of the present study. Plasma was separated from whole blood through centrifugation (5,000 ϫ g, 5 min, and 4°C) and filtration (0.45-m filter; Millipore, Billerica, MA). Plasma samples were stored at Ϫ80°C before being transported to Oxford, United Kingdom, in a liquid nitrogen dry shipper. None of the samples used had any record of previous freeze-thaw cycles.
In vitro culture of lab-adapted HIV-2 reference strains. The lab-adapted HIV-2 strains HIV-2 ROD and HIV-2 CBL20 were propagated in vitro in the lymphocyte cell line H9, a single cell clone derived from a HUT 78 cell line. Infection of 5 ϫ 10 6 cells was carried out with 200 l of 9 ϫ 10 3 50% tissue culture infective doses (TCID 50 )/ml of viral stock. Cells were removed through centrifugation at 250 ϫ g for 10 min, and supernatant was collected on days 3, 5, 7, 9, 11, 13, and 15. HIV-2 concentrations were assayed using a colorimetric reverse transcriptase assay (Roche). For each isolate, the supernatant sample with the highest reverse transcriptase concentration was selected for RNA-Seq.
RNA extraction, RNA quantification, and DNase treatment. Total nucleic acid was extracted directly from 500 l of patient plasma or purified supernatant using the QIAamp UltraSens viral kit (Qiagen). Extraction was performed according to the manufacturer's protocol, with the substitution of carrier RNA with linear acrylamide (Ambion) as the nucleic acid coprecipitant. Final elution was performed in 12 l of H 2 O. DNA was removed from the samples through treatment with DNase I (Turbo DNase; Ambion) according to the manufacturer's protocol. RNA concentration was estimated using the QuBit RNA assay (Invitrogen).
Library preparation and sequencing. Sequencing libraries were prepared from 5 l of the eluted RNA using the NEBNext Ultra RNA library prep kit for Illumina (New England BioLabs) according to the manufacturer's protocol. Sequencing libraries were multiplexed and sequenced using the Illumina HiSeq or MiSeq platform (Illumina). Patient samples were multiplexed at 6/lane (HiSeq), generating 2 ϫ 100 nucleotide paired-end reads, and lab-adapted strains were multiplexed at 2/lane (MiSeq), generating 2 ϫ 150 nucleotide paired-end reads.
De novo genome assembly and read remapping. Sequence data were analyzed using a custom pipeline. Reads were trimmed using Sickle, stipulating a median Q-score of Ͼ30 and a read length of Ͼ40 bp (45). De novo genome assembly was performed using VICUNA (46), with the addition of the optional contamination removal step. During contamination removal, HIV-2-derived reads were identified through similarity to a multiple-sequence alignment containing a set of 18 publicly available HIV-2 group A sequence data (Table S1). Overlapping contiguous sequences generated by VICUNA were assembled into whole-genome sequences using the map-to-reference feature in Geneious v6.1.6 (47) and manually inspected to derive a whole-genome consensus sequence. Consensus genome sequences were manually inspected to ensure that they contained intact open reading frames. Reads were remapped to the consensus genome sequence using Bowtie2 (48), BWA-SW (49), GSNAP (50), and NovoAlign (51) for each sample. Files containing assembled reads were manipulated using the SAMtools package (52), and downstream statistical analyses and data visualizations were performed using R (53) and the Interactive Genome Viewer (54). Error rates were estimated using the ErrorRatePerCycle feature of GATK (55).
Quantification of biases. Random-hexamer bias was assessed through visualization of the base composition of reads using FASTQC (56). GC bias was quantified using a custom Python script that scanned the genome using a 50-bp sliding window with a step size of 20 bp. Mean GC content and mean depth of coverage were computed for each window, and GC bias was assessed by fitting a linear regression in R (57).
Analysis of molecular properties. Analyses of molecular properties were performed using an in-house Perl script with potential N-linked glycosylation sites (PNGS) as defined in N-GLYCOSITE (37). Net charge of sequences was determined based on each lysine and arginine contributing ϩ1 and each aspartic acid and glutamic acid contributing Ϫ1. Total counts of amino acids were assessed as described previously (58). Coreceptor tropism was predicted using four major determinants of dual/CXCR4 coreceptor use (L18Z, V19K/R, V3 net charge of Ͼϩ6, and insertions at position 24) (59). CXCR4 use was considered when at least one of the criteria was fulfilled. Sample donors were classified as having been sampled during either the asymptomatic or AIDS stage (as defined by clinical assessment at the sample time point).