Local Evolutionary Patterns of Human Respiratory Syncytial Virus Derived from Whole-Genome Sequencing

ABSTRACT Human respiratory syncytial virus (RSV) is associated with severe childhood respiratory infections. A clear description of local RSV molecular epidemiology, evolution, and transmission requires detailed sequence data and can inform new strategies for virus control and vaccine development. We have generated 27 complete or nearly complete genomes of RSV from hospitalized children attending a rural coastal district hospital in Kilifi, Kenya, over a 10-year period using a novel full-genome deep-sequencing process. Phylogenetic analysis of the new genomes demonstrated the existence and cocirculation of multiple genotypes in both RSV A and B groups in Kilifi. Comparison of local versus global strains demonstrated that most RSV A variants observed locally in Kilifi were also seen in other parts of the world, while the Kilifi RSV B genomes encoded a high degree of variation that was not observed in other parts of the world. The nucleotide substitution rates for the individual open reading frames (ORFs) were highest in the regions encoding the attachment (G) glycoprotein and the NS2 protein. The analysis of RSV full genomes, compared to subgenomic regions, provided more precise estimates of the RSV sequence changes and revealed important patterns of RSV genomic variation and global movement. The novel sequencing method and the new RSV genomic sequences reported here expand our knowledge base for large-scale RSV epidemiological and transmission studies. IMPORTANCE The new RSV genomic sequences and the novel sequencing method reported here provide important data for understanding RSV transmission and vaccine development. Given the complex interplay between RSV A and RSV B infections, the existence of local RSV B evolution is an important factor in vaccine deployment.

H uman respiratory syncytial virus (RSV) is a leading viral cause of severe respiratory infection during infancy and early childhood and among immunocompromised populations (1,2). Globally, the virus is estimated to be responsible for 30 million episodes of acute lower respiratory tract infections (RTIs) and more than 50,000 deaths annually in children under 5 years of age (3). RSV infections throughout the world consistently occur as annual or biennial epidemics, and persons of all ages can be infected with diverse clinical outcomes ranging from mild upper RTIs to severe pneumonia or bronchiolitis (2,4). A vaccine against RSV is not yet available (5). Careful analyses of RSV molecular epidemiology, evolution, and transmission are essential for defining the circulating viruses, for characterizing antigenic variation, and for tracking transmission patterns. The outcome of these studies can support new strategies for RSV control and vaccine use and development.
It has long been known that children suffer repeated RSV infections throughout life (6,7). The ability of the virus to continue to infect previously exposed individuals is thought to be linked to an ability to bypass preexisting immune responses (8). Sequence variation in attachment (G) protein in consecutive years (9) is thought to be part of this mechanism; also, the global existence of two groups, A and B, and their alternating infection incidences may play a role (10,11). The transmissibility of RSV group A (RSVA) is estimated to be slightly higher than that of RSV group B (RSVB) (12), and RSVA infections are more frequent than RSVB infections (12). An additional important feature of RSV infection is the apparently rapid global dispersion of new RSV variants (13). Indeed, genetically similar viruses cluster more by time than by location, suggesting rapid global movement of new variants (14). RSV molecular pathology and epidemiology have been reviewed in detail elsewhere (2,4,15,16).
Historically, RSV molecular epidemiology has focused on the 900-bp region encoding the G protein (15,17). The G protein together with the fusion (F) protein are important targets of human protective antibody responses (8,15), with changes in this region thought to be driven by pressure to avoid host immune responses. Although studies of the sequence variability of RSV have concentrated on G gene variability, given the rapid infection pace but relatively low evolutionary rate of RSV, transmission studies over short periods require the stronger evolutionary signal provided by the full virus genome sequence (15, There is also a need to understand the nature of variation of immune targets other than the G protein. Advances in primer design, sequencing technology, and sequence assembly algorithms now allow full-genome sequencing for a number of viruses, including RSV (18)(19)(20)(21)(22)(23), norovirus (24), and Middle East respiratory syndrome (MERS) coronavirus (25,26).
The current work describes RSV genome evolution across a set of clinical samples collected from children who presented with severe RSV disease in a rural coastal Kenyan hospital using a novel RSV whole-genome sequencing (WGS) approach optimized for small amounts of clinical diagnostic samples. The sequence data provide an update of the genome-wide diversity of circulating RSV strains in this part of Kenya, including both RSVA and RSVB and the recently reemerged group B genotype GB3 (27). The novel genomes support previous conclusions on patterns of local RSV variation relative to global RSV diversity and reveal a significant difference in local evolution of RSVA versus RSVB.

MATERIALS AND METHODS
Primer design. All RSV sequences available (August 2012) with lengths of Ͼ14,000 nt were collected and sorted by group, yielding 138 RSVA and 38 RSVB genomes. The sequences for each group were pooled and sliced into 33-nt strings with a 1-nt step size. The 33-mers were filtered to remove sequences with ambiguous nucleotides, and the frequency of each sequence within the set was determined. The 33-nt sequences were then trimmed to a calculated melting temperature (T m ) of 58°C, discarding sequences mapping to human rRNA, with GC contents of Ͻ30% or Ͼ65%, or with a single nucleotide frequency of Ͼ60%. The RSV genome was divided into six 3-kb segments overlapping by 300 nt. All sequences were mapped to an RSVA or RSVB reference strain, and the two most frequent primers mapping within 300 nt of the end of each amplicon were selected. The reverse complement of the downstream sequences was prepared. To ensure amplification of the far ends of the genomes, two additional primers were included from the 5=-and 3=-terminal genomic regions. A summary of the primer sequences and their predicted target sequences across all known RSV genomes is presented in Table 1.
Clinical samples. Viral nucleic acid for sequencing was extracted from RSV-positive clinical specimens (nasopharyngeal swabs [NPS] or washes) collected from children under 5 years old admitted to the Kilifi District Hospital (KDH) with severe or very severe pneumonia between 2002 and 2012. RSV infection was diagnosed with an indirect immunofluorescence antibody technique (IFAT; Light Diagnostics). Details of the study that provided the samples sequenced in this study have been previously provided (28). Informed consent was obtained from a parent or guardian on behalf of each child before specimen collection, and the KEMRI Ethics Review Committee approved all protocols. Additional details on the samples are provided in Table 2. RNA extraction, RT, and PCR. Viral RNA was extracted with the QIAmp extraction kit (Qiagen, United Kingdom) from a starting NPS specimen volume of 140 l and final elution volume of 60 l. Reverse transcription (RT) of RNA molecules was performed with the forward primers for each of the six amplicons. A separate RT reaction was performed for each amplicon. Typically, the 20-l reaction mixture contained 2 l of sample RNA. A 5-l aliquot of the resulting cDNA was used in each 25-l PCR mixture. The PCR mixture was incubated at 98°C for 30 s, followed by 40 cycles of 98°C for 10 s, 53°C for 30 s, and 72°C for 3.0 min and a final extension of 72°C for 10.0 min. Following PCR, aliquots of the products were run on a 0.6% agarose gel to monitor amplification success, and the products from the 6 reactions for each sample were pooled for Illumina library preparation.
Deep sequencing. Sequencing of the pooled amplicons was performed with Illumina MiSeq. Samples were multiplexed at 15 to 20 per MiSeq run and processed as paired-end reads (2 ϫ 149 nt), generating approximately 1.5 million reads per sample. Raw sequence data were pro-cessed with QUASR (29) to remove low-quality (Ͻ median Phred 35) and adapter-containing reads, and de novo assembly with SPAdes (30) was performed. RSV contigs were identified by BLASTN analysis, and lowcoverage contigs were excluded. Where necessary, partial but overlapping genome contigs were combined using Sequencher (v5.2.4). All final viral genomes were examined for appropriate assembly based on length and the presence of the expected intact RSV open reading frames.
Protein changes. After sorting by virus group (RSV group A or B), the genomic region under investigation was translated, the protein sequence was aligned using MAFFT (31), and protein differences from the consensus sequence of the group were visualized and quantitated using Python scripts.
Reference data set. A comprehensive RSV genome data set was generated from the GenBank database using as a starting set all reported RSV genomes. The search was conducted on 28 September 2014 using the search term "txid11250 [Organism]) AND 13500[SLEN]: 17000 [SLEN]." Genomes with multiple ambiguous bases, lacking country of detection or date of collection (year), or from patent depositions were excluded. The newly sequenced Kilifi RSV genomes for each group were combined with those from GenBank in the subsequent analysis. Thinned representative reference sets were prepared by using the usearch algorithm (32).
Phylogenetic analyses. Phylogenetic trees of the genome sequences and selected genomic regions were constructed using the Bayesian methods in MrBayes program v3.2.1 (http://mrbayes.sourceforge.net/index .php) under the general time reversible model of evolution. RSVA and RSVB were analyzed separately using both the total data set and the thinned data sets. The viruses within the groups were assigned to genotypes based on the clustering pattern of the G ORF portion sequences with reference sequences representative of the previously described RSV genotypes: for RSVA, strains representing GA1-7, SAA1, and ON1, and for RSVB, strains representing GA1-4, SAB1-SAB4, and BA (11,(33)(34)(35). Phylogenetic trees were visualized in FigTree v1.4.2.
Evolutionary analyses. Nucleotide substitution rates and estimates for time to most recent common ancestor (tMRCA) were obtained from the usearch-thinned data sets, using uclust to remove genomes closer than ID 0.99 (32). The rates and tMRCA estimates were calculated in BEAST v1.7.5 (36) both for full genomes and for the individual ORFs.
Nucleotide sequence accession numbers. The final set of RSV sequences was deposited in GenBank with the following accession numbers: KP317916 to KP317956.

RESULTS
Two sets of reverse transcription and PCR primers were selected from all available RSVA and RSVB genomic sequence data based on frequency, location, and predicted PCR function (see Table 1 for further details). The general pattern of primer sites and the locations of primer targets in RSVA and RSVB genomes are shown in Fig. 1A. Actual PCR results are shown in Fig. 1B for RSVA and RSVB samples, with PCR products of the expected size obtained for all 6 amplicons. These primers were used as part of a deepsequencing process for RSV combining the full cDNA preparation and genome amplification, deep sequencing with Illumina MiSeq, and de novo assembly (Fig. 1C) to generate 27 complete or nearly complete genomes (11 group A and 16 group B; median length, 14,990 nt; range, 14,666 to 15,232 nt). An additional number of samples yielded RSV contigs of Ͼ5,000 nt in length, and these were also retained for further analysis. A summary of the genomic sequences in this study is provided in Table 2.
RSV global phylogenetic clustering and placement of Kilifi genomes. The 27 Kilifi genomes were combined with RSVA and RSVB genomes from 16 countries from specimens collected between the years 1981 and 2013 (see Materials and Methods). The phylogenetic clustering is shown in Fig. 2A (RSVA) and B (RSVB).
RSVA forms 3 major clades: GA1 (including strains only from the United States), GA5 (with U.S. and global strains), and a clade with both GA2 and the ON1 viruses with a 72-nucleotide duplication in the G ORF (33), which included nearly all of the new Kilifi RSVA genomes (GA2_ON1). Multiple subclusters showing temporal clustering were detected within each of these clades.
Four clades were designated for the RSVB genomes, with BA containing the majority of the Kilifi sequences (Fig. 2B). Clade Comparison of genomes of viruses with identical G protein ORFs. One motivation for developing full-genome methods was to increase the sensitivity for tracking RSV across short-term transmission chains. We asked if viruses identical in their G gene regions had differences elsewhere in their genomes. All RSV genomes (both GenBank or in the new data presented here) with identical G regions were identified, and the number of changes outside the G region were determined. Of 7 sets of viruses with identical G regions, all showed at least 1 but up to 9 nucleotide differences across the full genome (Fig. 3). This increased resolu-  (Fig. 5). d Samples yielding sufficient sequence for F region analysis (Fig. 5). e The final genome data were deposited in GenBank with the indicated accession numbers. f Short-read data available at European Nucleotide Archive (http://www.ebi.ac.uk/ena).
tion will be important in future studies examining RSV household transmission patterns to identify who acquires infection from whom. Estimation of RSV tMRCA and evolutionary rates. Previous data on RSV evolution are largely derived from the G protein coding region. The full genomes generated in this study were combined with the GenBank reference data set, and these allowed an estimation of the global nucleotide substitution rates and the time to most recent common ancestor (tMRCA) for all the recently sequenced RSVA and RSVB viruses. These estimates were calcu- lated for the different ORFs and the whole-genome sequences (Fig. 4). The whole genomes provided more precise estimates of the MRCA, as observed from the interval of lower and upper 95% highest posterior density (HPD) compared to individual ORF data for the same set of viruses (Fig. 4B).
The G protein ORF showed the highest nucleotide substitution rates for both RSVA and RSVB (Fig. 4A). Elevated changes in G and M2-2 were observed previously using RSV full genomes from U.S. and European cohort data (21). Similar to the MRCA estimates, the whole-genome estimates for the evolutionary rates showed narrower confidence intervals than those from the individual ORFs. The two regions considered for vaccine targets, G and F, show a strikingly wide difference in rate, and this may be important for selecting conserved vaccine targets.
Changes in G and F coding regions, comparing local and global viruses. An important consideration for vaccine development is how representative a vaccine strain is for locally circulating viruses. The transmission patterns of a virus, the evolutionary rate of the virus, and patterns of human movement can strongly influence how quickly global strains reach a rural location. To address this important issue, the amino acid changes encoded by the RSV coding sequences observed in Kilifi were compared to the amino acid changes observed for all known RSV genomes from other parts of the world (Fig. 5; Table 3).
A large percentage of the changes observed in the RSVA G protein were also observed globally, with 88% of the changes seen in Kilifi RSVA G also observed in other parts of the world ( Table  3). The Kilifi RSVB viruses appeared to have more local evolution, with only 60% of the observed changes in G shared with global viruses. With reference to the F protein, for Kilifi RSVA viruses, 80% of the observed changes were also found globally, while the Kilifi RSVB viruses showed a higher degree of local evolution, with only 20% of the observed changes specific to Kilifi viruses seen in other locations. To determine if this local evolution of RSVB was observed at other sites, the sequence data were stratified to other locations (the United States, Argentina, and Peru), but no significant local patterns were observed. This suggests that the isolation of the Kilifi site was more pronounced than for other sites. Alternatively, this may reflect more intense sampling of RSVB within a limited area.
The RSV envelope proteins are heavily glycosylated. More than 50% of the G protein mass can be carbohydrate (37), and the potential O-linked glycosylation sites (serine or threonine) comprise up to 30% of the G protein amino acid sequence (38). Changes toward or away from asparagine can be associated with a change in the overall glycosylation of the protein and could be associated with adaptive change to local immune responses. The G protein is subject to heavy O-linked glycosylation in the variable regions, with modification frequently on serine or threonine residues in the vicinity of a proline residue (Fig. 5A). Nearly half of the observed changes in the in G protein affect S, T, or P residues (RSVA GA2 37/81 and RSVB 100/241). This is apparent when potential N-and O-linked glycosylation sites are marked on the G protein region (Fig. 5) and is also facilitated by the single nucleotide changes that distinguish codons for these three amino acids.
In the RSVA G proteins, an N237D polymorphism observed in many of the viruses is within the site NTT and would remove a predicted N-linked glycosylation site. Tan et al. (22) also noted that the RSVA-GA2 group showed a frequent change in two predicted glycosylation sites (N237D and S242N). Within the RSVB viruses, 3 of 15 amino acid changes involve asparagine, but none of these changes are predicted N-linked glycosylation sites (Asn-Xaa-Ser/Thr). Changes in N-linked glycosylation areas are known to effect binding of human convalescent-phase sera to peptides (39).
The RSV F protein contains only 10 to 20% of its mass as carbohydrate, and this is attached exclusively via N-glycosidic bonds (37). For the RSVA viruses, 5 of the protein changes are to or away from asparagine. In RSVB, 3 changes involve asparagine; however, none of these are within predicted glycosylation sites. Many polymorphisms were observed in the F protein p27 domain (Fig. 5B). This peptide is likely to serve as a spacer that is freed by cleavage during F maturation and is not found in the mature protein. The large number of changes may simply reflect the disposable nature of this sequence (40).
The NS2 protein may be important in modulating host innate immune responses (41)(42)(43) and may influence movement of infected cells (44). The NS2 showed an elevated level of evolutionary rate (Fig. 4), consistent with a protein interacting with polymorphic host target proteins. Monitoring the local versus global protein changes in NS2 revealed multiple changes occurring in the amino-terminal domain and a portion of the domain important for TRAF3 interactions (43). The majority of changes in the Kilifi RSVA NS2 proteins were also observed in other parts of the world; however, the RSVB NS2 protein showed a significantly high degree of variation only observed in the Kilifi viruses (Fig. 5C).

DISCUSSION
The current work presents a functional approach for community-wide monitoring of RSV whole-genome genetic diversity suitable for detailed transmission studies. A challenge with deep sequencing of large sample sets of RNA viruses is the design of amplification primers. Traditionally, PCR primers were designed using alignments of sequences from the target virus; previous RSV studies with dideoxy sequencing used a greater number of tiled amplicons (2,24,25,38) to cover the whole genome. With larger and more diverse sets, the alignment step becomes problematic. The approach described here bypasses the alignment step and was tailored for deep-sequencing methods. The RSV method uses only 6 amplicons to reduce the amplification costs and the required amount of input RNA. Although two primer sets were designed for RSVA and RSVB, the two sets can be pooled to simplify processing of samples of unknown RSV subtype. The computational methods used for primer selection facilitates updating of the primer sets as additional RSV genome sequence data become available. Frequent updating of these primer sets will help avoid sequence bias that could occur using antiquated primer sets. It is also important that the new full genomes reported here were assembled using de novo assembly methods. Although reference-based methods for assembling genomes from short-read data are rapid and less memory intensive, reference-based methods fail if a close reference genome is not available. The method presented here determines virus genomic sequences directly from patient material and shows sensitivity similar to that of traditional sequencing methods, but it avoids the potential virus selection that may occur if samples are first passaged through cell culture.
The 27 novel Kilifi RSV genomes (11 RSVA and 16 RSVB) generated in this study were used to assess local versus global RSV variety. Similar to the patterns previously observed with G ORF, the full genomic phylogenetic analysis confirmed that Kilifi genomes were interspersed with genomes from other countries, with rapid appearance of variants in Kilifi soon after they are first observed in other parts of the world (45). Kilifi RSV strains are similar to strains that circulate in other regions of the world and reveal only limited local evolution. Phylogenetic clustering appeared to be more influenced by time of virus sample collection than by geographical location, suggesting a fairly rapid global spread of novel RSV variants. It should also be noted that the similarity of the overall topology of phylogenetic trees from whole genomes and G sequences is encouraging and indicates that although fullgenome sequences are most useful for detailed transmission studies, the relationships determined with the G region is similar to the patterns observed with the full-genome sequences.
The availability of full genomes allowed a comparison of estimates of the tMRCA of the Kilifi RSV strains. The obtained tMRCAs were broadly similar, although the higher evolutionary rates of the G region lead, as expected, to slightly later tMRCAs. The estimates based on the entire genomes lead to earlier dates and more discrete confidence intervals than estimates from specific genomic regions. Similar observations were made by Tan et al. (22).
Our comparison of genomes determined to be identical in the G region found nucleotide substitutions elsewhere in their genomes. The genomes with identical G regions invariably were from the same geographical region and over the same epidemic, the sample collection date interval ranging from a few days to months. This observation suggests that nucleotide substitution in the RSV genome in the short term is random, i.e., not concentrated in the regions that appear the most variable in the long term, and supports the use of whole-genome sequencing for monitoring viral transmission chains.
The observed sites of change in the G and F proteins were frequently in exposed regions of the proteins; several involved glycosylation site changes suggestive of immune evasion. In addition, similar to previous reports, the NS2 (Fig. 5B) and M2-2 protein coding regions (not shown) were observed to change at rates higher than that for the full genome. Although these changes could be simply the allowed changes of unconstrained proteins, it is also possible that these sites are important for interacting with the host and may be under some pressure to change. Unfortunately, the sequence data set generated in this study was too small to provide statistically supported evidence of positive selection, but future studies with larger data sets will be facilitated by these methods.
The availability of a collection of RSV genome sequences from a single African location allowed a comparison of local versus global RSV evolution patterns. Important for vaccine design, the RSVA variants observed in a small region of Kenya appear to be in equilibrium with global variants. The same was not observed for RSVB. Possibly, RSVB variants may spread less efficiently, with a higher fraction of variants observed to be specific for Kilifi and not detected in other parts of the world. This pattern is consistent with RSVB as a less transmissible infection than RSVA (4,12). However, there are fewer global sequences available for RSVB, so while the Kilifi RSVB variants appear to be unique, this could be a consequence of less surveillance and documentation of RSVB variation globally. Future work will help clarify this phenomenon, as it