Deep Sequencing of Norovirus Genomes Defines Evolutionary Patterns in an Urban Tropical Setting

ABSTRACT Norovirus is a highly transmissible infectious agent that causes epidemic gastroenteritis in susceptible children and adults. Norovirus infections can be severe and can be initiated from an exceptionally small number of viral particles. Detailed genome sequence data are useful for tracking norovirus transmission and evolution. To address this need, we have developed a whole-genome deep-sequencing method that generates entire genome sequences from small amounts of clinical specimens. This novel approach employs an algorithm for reverse transcription and PCR amplification primer design using all of the publically available norovirus sequence data. Deep sequencing and de novo assembly were used to generate norovirus genomes from a large set of diarrheal patients attending three hospitals in Ho Chi Minh City, Vietnam, over a 2.5-year period. Positive-selection analysis and direct examination of protein changes in the virus over time identified codons in the regions encoding proteins VP1, p48 (NS1-2), and p22 (NS4) under positive selection and expands the known targets of norovirus evolutionary pressure. IMPORTANCE The high transmissibility and rapid evolutionary rate of norovirus, combined with a short-lived host immune responses, are thought to be the reasons why the virus causes the majority of pediatric viral diarrhea cases. The evolutionary patterns of this RNA virus have been described in detail for only a portion of the virus genome and never for a virus from a detailed urban tropical setting. We provide a detailed sequence description of the noroviruses circulating in three Ho Chi Minh City hospitals over a 2.5-year period. This study identified patterns of virus change in known sites of host immune response and identified three additional regions of the virus genome under selection that were not previously recognized. In addition, the method described here provides a robust full-genome sequencing platform for community-based virus surveillance.

N orovirus is a nonenveloped, positive-sense, single-stranded RNA virus approximately 7.5 to 7.7 kb in length (reviewed in reference 1). The viral genome is organized into three (or four in the case of murine norovirus [MNV] [2]) open reading frames (ORFs) that encode several structural and nonstructural proteins. ORF1 encodes a large polyprotein that is proteolytically cleaved into six nonstructural proteins, including the N-terminal p48 protein (NS1-2), an NTPase (NS3), the 3A-like p22 protein (NS4), the viral genome-linked VpG protein (NS5), the 3C-like protease 3CLpro (NS6), and the RNA-dependent RNA polymerase RdRp (NS7). Note that the nomenclature for the NS proteins is currently in flux, and both existing names have been included (3). ORF2 overlaps ORF1 by a short region and encodes the major capsid protein VP1, comprising an S (shell) domain connecting the two P (protruding) subdomains, P1 and P2, with the P2 domain binding to histo-blood group antigens (HBGAs) on target host cells. ORF3, located at the 3= end of the genome, encodes the minor capsid protein VP2.
In humans, norovirus is a highly infectious pathogen that causes a severe gastrointestinal disease in susceptible individuals after the ingestion of an exceptionally small number of viral particles. The virus is so infectious that the probability of symptomatic disease from a single norovirus virion has been estimated to be as high as 0.5 (13). The dose required to infect 50% of test subjects has been estimated to be 1,000 to 3,000 virus genome equivalents (14). A typical norovirus infection can result in profuse volumes of feces and vomitus containing 10 6 to 10 9 stable, nonenveloped virions per milliliter of excreta, creating almost infinite opportunities for onward transmission and additional infections. An inability to culture human noroviruses in a laboratory prevents the testing of inactivation and disinfection methods and further complicates control efforts. These issues highlight some of the difficulties in eliminating infectious norovirus from food supplies and the environment and indicate the need for the development of intel-ligent approaches to prevent norovirus transmission and infection. An effective approach to controlling norovirus may be to understand how norovirus evades the human immune system and use this information to develop novel therapeutic options. Norovirus infection in a "healthy" individual is typically short and selflimiting, which results in transient or short-lived immunity (15,16). No approved drugs that block virus replication exist. Accordingly, public health measures to identify and eliminate sources of infection or behavior leading to virus spread are warranted (17,18). The utility of viral sequencing to track norovirus in transmission studies has been explored with fragments of the viral genome (19)(20)(21)(22). As a consequence of the speed of disease onset and high transmissibility, the number of nucleotide and amino acid sequence changes within a local outbreak may be rare, so the sequencing of larger genomic fragments should provide greater resolution for defining transmission patterns.
The natural duration and specificity of immune responses to norovirus are difficult to measure because of the lack of a cell culture system for norovirus neutralization studies and the inability to grow a defined virus for such trials (reviewed in references 16 and 23). The duration of norovirus immunity may be limited by the short period of a typical infection and a correspondingly short exposure to viral antigens. Periodic population level replacement of norovirus lineages with viruses with surface residues under positive selection is evidence of immune response-driven antigenic change and suggests that these immune responses are of sufficient strength to drive viral evolution (24)(25)(26). Immune studies have identified blockade epitopes in VP1, the major capsid protein.
These epitopes are important for interaction with HBGAs on target host cells; high titers of antibodies that block virus-like particle binding to HBGAs correlate with protection from a norovirus challenge (27)(28)(29).
Diarrheal diseases are a serious health problem, especially in developing countries when combined with nutritional problems, coinfection with other pathogens, crowding, and limited access to health care. It is clear that norovirus and rotavirus are frequently associated with diarrhea in this setting (30), and it is essential to closely follow the local evolution of norovirus. We describe here a method for deep sequencing of the approximately 7,500-nucleotide (nt) norovirus RNA genome directly from patient material and use this method to provide a detailed description of genomeand community-wide norovirus evolution.

MATERIALS AND METHODS
Primer design. Primers were designed by using Python algorithms to identify highly conserved primer targets in the appropriate genome locations. Briefly, the algorithm takes as input all of the complete human norovirus genome sequences available in the GenBank database (January 2012, 260 GII.4 entries, 5 GI entries; total sequence, 1.9 ϫ 10 6 nt). A counting method was employed to identify all of the highly conserved primer-like sequences with GϩC percentages between 30 and 75%, calculated melting temperatures (T m s) between 55 and 59°C, and no single nucleotide comprising greater than 40% of the sequence. The norovirus genome was divided into three overlapping 2.5-to 3-kb amplicons, and the highest-frequency primer sites in the first and last 800 nt of each amplicon were selected. Finally the primers were used in a virtual PCR to determine the binding behavior of the primer set with all of the available full norovirus genomes (see Fig. 1). Primer details are summarized in Table 1. Sample collection. Stool samples were obtained as part of a larger study examining causes of pediatric diarrhea in subjects presenting to Children's Hospital 1, Children's Hospital 2, and the Hospital for Tropical Diseases, Ho Chi Minh City (HCMC), Vietnam (30,31). Additional samples came from an ad hoc enrollment of children admitted to Children's Hospital 2 with potentially hospital-acquired norovirus diarrhea or prolonged norovirus incubation. In the ad hoc collection, pediatric patients were admitted to the hospital because of diseases other than diarrheal diseases and had no diarrhea when they arrived at the hospital. The group included only patients who developed diarrhea after at least 48 h of hos- Generation of amplified cDNA for deep sequencing. For RNA extraction, 140 l of each stool specimen was subjected to automated extraction into a final 50-l elution with the MagNA Pure 96 automated extraction machine according to the manufacturer's instructions (Roche). Reverse transcription (RT) was performed as previously described (32). Briefly, a primer mixture was prepared separately for each amplicon; the reverse primers for the amplicon were pooled in an equimolar ratio and water added up to 7 l of the primer mixture (7.6 pmol of each primer; 0.38 pmol/l per reaction mixture). Extracted norovirus RNA was diluted 1:10 in water; 5 l of this dilution was added to  the primer mixture, which was heated for 5 min at 65°C and immediately transferred to an ice block for 1 min. An enzyme mixture was then added to each reaction mixture and mixed by pipetting. RT was performed at 50°C for 60 min, followed by 70°C for 15 min. PCR amplification. Amplification was performed with primer mixture solutions prepared for each amplicon. For the primer mixture (per 25-l reaction mixture), the forward and reverse primers from each amplicon were pooled in a 1.5:1 ratio (1.9 pmol of each forward primer and 1.26 pmol of each reverse primer; 0.08 pmol/l and 0.05 pmol/l, respectively). A 5-l aliquot of the RT reaction mixture for each amplicon was used as the template for the PCR step. The thermal cycling conditions used were enzyme activation at 98°C for 30 s; 35 cycles of 98°C for 10 s, 53°C for 30 s, and 72°C for 3.0 min; a final extension at 72°C for 10 min; and holding at 4°C.
Sequencing and genome assembly. Pooled amplicons for each sample (approximately 1.2 g) were individually indexed and subjected to sequencing with Illumina MiSeq (33,34) to generate approximately 300,000 reads of 149 nt per sample (median value, 302,904 reads). All reads were processed with QUASR (35) to remove sequencing adapters and index sequences and to trim primer sequences present within a fixed distance of the 5= or 3= end of a read. Reads were then trimmed from the 3= end to reach a minimum median Phred quality score of 35, and reads Ͻ125 nt in length were removed. After primer trimming and quality control for each sample, de novo assembly with SPAdes (36) was used to generate full norovirus genomes. Intact ORFs were checked with Python scripts as a measure of correct genome assembly.
Recombination detection. The 119 complete genomes of all of the GII noroviruses from this study and from global data (retrieved from the GenBank database) were manually aligned with Se-AL v2.0 (http://tree .bio.ed.ac.uk/software/seal/). Only full-length sequences with information on the sample collection date and location were included in this analysis. The potential presence of recombination in these complete sequences was screened for with the Recombination Detection Program version 4 (RDP4) software (37). The RDP, GENECONV, 3SEQ, and MAXCHI methods were employed for primary screening, and the BOOTSCAN and SISCAN methods were used for automatic checking of the recombination signals, as described previously (38). The automask X function in RDP4 was selected for optimal recombination detection; i.e., one representative strain within each group of similar sequences was examined during the primary/exploratory search for recombination signals while the remaining sequences within groups of sequences with high similarity were automatically masked. By this method, masked sequences were examined for the presence of recombination if the program detected a recombination signal in the representative unmasked sequence. Each test of recombination used a 400-nt sliding window, and any recombination signals with significant P values for three or more test parameters were considered potential recombination events. A further analysis of these potential recombinants, comparing tree topologies with likelihood (Shimodaira-Hasegawa test) was employed to determine which of the test strains were likely to be true recombinants and which were not. All intra-ORF recombinant strains (GenBank accession numbers EU921388, AB541275, GU991355, and AB541254) were excluded from the estimation of positive selection and evolutionary rates.
Phylogenetic analysis. An alignment of nonrecombinant sequences including all of the full genomes determined in this analysis and global background sequences obtained from the GenBank database was utilized to reconstruct evolutionary relationships among norovirus sequences. A phylogenetic tree was inferred by using aligned nucleotide sequences, employing a maximum-likelihood (ML) method in RaxML (39) under the GTRϩ⌫ model of substitution, which was determined to be the model that fit our data best with jModelTest version 2.1.1 (40). Tree topology was assessed through bootstrapping with 1,000 pseudoreplicates. The resulting phylogenetic tree was visualized and edited in FigTree v1.4.0 (http: //tree.bio.ed.ac.uk/software/figtree/). Evolutionary-rate estimations. Evolutionary rates were estimated by a Bayesian Markov chain Monte Carlo (BMCMC) method implemented in BEAST version 1.7.2 (41). A relaxed uncorrelated lognormal molecular clock was employed to account for lineage-specific rates, and a GMRF Bayesian skyride coalescent (42) was used to model the population dynamics. The relevant substitution models for each alignment were selected with jModelTest version 2.1.1 (40). The mean evolutionary rate and the 95% upper and lower highest posterior density (HPD) intervals were inferred from the posterior tree distribution generated from the BMCMC runs with Tracer version 1.6 (http://tree.bio.ed.ac.uk/software/tracer/).
Positive-selection analysis. To determine evolutionary patterns of norovirus, selection analyses of the regions encoding VP1, VP2, and the ORF1-encoded p48 (NS1-2) and p22 (NS4) proteins were performed. Norovirus codons under selective pressure were first determined with the mixed-effects model of evolution (MEME; P value, Ͻ0.05) (43) and fast unconstrained Bayesian approximation (FUBAR; posterior probability, Ͼ0.9) (44) implemented through the DataMonkey web browser (45). Codons that were found to be under positive selection by either method were inspected at the sequence alignment, and those with no evidence of polymorphisms were considered false positive and discarded.
Ancestral sequences were reconstructed from the sequence alignment and inferred phylogeny by the joint-likelihood method implemented in HyPhy (46) under a GTRϩ⌫ model of evolution.
Nucleotide sequence accession numbers. The GenBank accession numbers of all of the new norovirus sequences reported here are listed in Table 2. Also listed are the sample collection dates, the genetic clusters (see Fig. 2

Norovirus sequencing strategy.
A novel general strategy for designing PCR primers was developed that would permit the production of complete norovirus genome sequences. Deep sequencing of RNA virus genomes requires RT of viral RNA and amplification of the resulting cDNA, which encompasses the entire viral genome. Python algorithms were used to process all of the available norovirus full-genome data (265 full genomes, January 2012) and to select primer target sequences suitable for whole-genome amplification. Briefly the algorithm processes the norovirus sequence data into primer-sized sequences trimmed to a calculated T m . The frequency of each sequence in the entire set is calculated, with high-frequency sequences correlating with conserved sites across the viral genome. The norovirus genome was divided into three overlapping amplicons, potential primers were mapped to a reference genome, and the highest-frequency sequences mapping within the terminal 800 nt of each amplicon were identified. Reverse complements of the primers mapping to the 3= end were prepared. A virtual PCR was performed to examine the potential function of the primers across all known full norovirus genomes. The output of such an analysis is shown in the left panel of Fig. 1, with blue markers indicating the position of each primer and gray bars indicating the expected PCR product. The actual function of the primer set is demonstrated in the right panel of Fig. 1, with each lane showing the PCR products from 14 samples, present by amplicon. Each RT reaction mixture contained two (or three for amplicon 3) reverse primers each for amplicon 1, 2, or 3, and each PCR mixture contained two (or three for amplicon 3) forward and reverse primers for amplicon 1, 2, or 3. Of these samples, no. 7 failed; however, the remaining 13 samples provided sufficient material for deep sequencing. A summary of the predicted performance of the norovirus primer set with all of the available norovirus genomes is shown in Table 1. All full-length norovirus GII genomes (taxonomic iden-tification no. 142786; length, 7,000 to 8,000 nt; 517 entries) or all norovirus genomes (taxonomic identification no. 122929; length 7,000 to 8,000 nt; 753 entries) were retrieved from the GenBank database. These genome sets were examined for the target sequence for each primer, and the percentage of genomes with a perfect match to the target sequence or with a functional match (zero to three mismatches) to the target sequence was reported. For the norovirus GII genomes, the primers have a perfect match to 79% of the genomes and a functional match (up to 3 mismatches) to 97% of the genomes. For the complete set of norovirus genomes (this includes all GI, all GII, and all animal noroviruses), the primers have a perfect match to 65% of the genomes and a functional match (zero to three mismatches) to 82% of the genomes. These values and the details of the analysis, as well as the GC contents and calculated T m s for all of the primers, are listed in Table 1.
A summary of the performance of the norovirus primer set for amplifying and sequencing 188 fecal sample-derived RNAs is presented in Table 3. PCR success was defined as obtaining the three amplicon-specific RT-PCR products of the predicted size with sufficient yield for sequencing library preparation. The overall RT-PCR success rate was 78.2% (147 of the 188 clinical samples tested). The most common genotype globally, GII.4, had the highest PCR success rate (93.7%, 74 of 79 samples), followed by GII.6 (88%, 7 of 8 samples), GII.13 (83%, 5 of 6 samples), and GII.3 (77%, 26 of 34 samples). Much lower amplification efficiency was observed for GI strains, with successful PCR genome amplification in only 2 of 10 samples tested. The high success with GII with respect to GI strains (especially GII.4) was predicable given that GII.4 genomes dominate the sequences in public databases. Future primer sets could be reiteratively designed by using targeted and revised genome data sets.
Norovirus diversity in HCMC. By using the whole-genome sequencing technique developed, 112 novel GII norovirus genomic sequences were generated. In addition, 89 GII.4 genomes from the same HCMC study were also publically available in the GenBank database; these were included in the following analysis for a total of 201 complete genomes with collection dates between April 2009 and December 2011. A phylogenetic analysis of the 201 genomes defined eight genotypes of GII norovirus by ML methods (Fig. 2). Consistent with previous characterization of norovirus  (Fig. 2). Our genotype assignment based on phylogenetic reconstruction was consistent with the genotype designation generated by the RIVM algorithm (47) ( Table 4) Evolutionary rates within each cluster. A sufficient number of genomes were available from clusters 1, 4, and 5 for well-supported evolutionary-rate estimations ( Table 5). Mean evolutionary rates of 6.15 ϫ 10 Ϫ3 , 5.73 ϫ 10 Ϫ3 , and 5.34 ϫ 10 Ϫ3 substitution per site per year were estimated from the full genomes of clusters 1, 4, and 5. Figure 4 plots the rates for the GII.4 cluster 1 viruses by the region of the genome used for each calculation.
The ORF-specific rates estimated for the three genetic clusters show that the ORF1 regions exhibited a lower rate than those of the ORF2 (VP1) regions. For all three clusters, the ORF1 and ORF2 (VP1) regions showed rates modestly lower than that of the FIG 3 Temporal appearance of the HCMC norovirus GII genotypes during the study period. Genomes were stratified by genotype (from Fig. 2), color coded, and plotted by date of sample isolation. full genome, while the ORF3 (VP2) substitution rates of both cluster 1 (8.99 ϫ 10 Ϫ3 substitution per site per year) and cluster 5 viruses (7.38 ϫ 10 Ϫ3 substitution per site per year) were higher than that of the whole genome. The overlapping confidence intervals for these estimations make these conclusions less secure. The amount of signal available for cluster 4 ORF3 was not sufficient to yield a reliable rate estimate. Norovirus ORF1 encodes a large polyprotein containing the viral polymerase and protease and several essential replicase components. Evolutionary rates were estimated separately for these individual coding regions of cluster 1 ORF1 (Table 5; Fig. 4). The region encoding p22 (NS4) showed the highest levels of change (6.60 ϫ 10 Ϫ3 and 8.21 ϫ 10 Ϫ3 substitution per site per year, Fig.  4), greater than the whole-genome rates for cluster 1 (6.15 ϫ 10 Ϫ3 substitution per site per year). The enzymes (NTPase [NS3], protease, and RdRp [NS7]) and VP1 show substitution rates modestly lower than those observed across the whole genome.
Amino acid changes in norovirus proteins. The evolutionary patterns of four norovirus-encoded proteins with the higher evolutionary rates were examined (VP1, VP2, p48 [NS1-2], and p22 [NS4]). An alignment of protein sequences ordered by time was used to detect sustained versus sporadic changes in the protein relative to a reconstructed ancestral sequence. Information about the biochemical properties of the protein was gathered from the published literature. Positive-selection analysis was performed with MEME (43) or FUBAR (44).
Cluster 1 VP1 showed changes in multiple patients relative to the ancestral sequence, i.e., Q106R, S174P, and N298D in blockade epitope A and G340E and G393S in blockade epitope D (Fig.  5). Additional substitutions were seen at a lower frequency, suggesting evolution during the course of transmission through HCMC. Position 298 in blockade epitope A was found to be positively selected with FUBAR, while both FUBAR and MEME identified position 106 within the shell domain (Fig. 5) as being under positive selection (Table 6). An alignment of VP2 protein sequences ordered by time was used to detect sustained versus sporadic changes in the protein relative to the ancestral sequence. Several changes, including T139M/A, I144V/T, and Y169H, occurred in multiple HCMC cluster 1 viruses with a much higher frequency of changes in the internal region of the protein (Fig. 6). It was previously noted that changes in this region of VP2 (VP1-interacting domain [VP_ID]) were associated with changes in VP1 (48). Both MEME and FUBAR identified VP2 codon 144 (marked with a red asterisk in Fig. 6) as being under positive selection.
The region encoding p22 (NS4) from the cluster 1 viruses showed higher evolutionary rates than the full genome (Table 5; Fig. 4). Analysis of all of the encoded p22 (NS4) molecules from cluster 1 (Fig. 7) showed amino acid differences from the ancestral sequence. Substitutions were observed in multiple isolates, suggesting neutral or positive selection (I29V, E46D, N77S, R82K, T86S, and D174V). Analysis of all of the encoded p48 (NS1-2) molecules from cluster 1 (Fig. 8) showed amino acid changes appearing in multiple isolates, suggesting neutral consequences with Methods, and mean values are indicated by colored circles, and error bars show 95% confidence intervals. The region of the norovirus genome used for calculation is labeled, and the two regions with rates higher than that of the full genome are in red. no constraints to limit change or positive selection (D7V, N15D, R55K, V79T [or V79A], and S184P). Both MEME and FUBAR identified p48 (NS1-2) codon 79 as being under positive selection ( Table 6).

DISCUSSION
Our work outlines a strategy for full-genome deep sequencing of norovirus directly from fecal specimens, and we have applied the strategy to characterize norovirus samples collected across a clinical spectrum of pediatric norovirus infections in HCMC, Vietnam. An essential component of the methods is a primer design algorithm that takes as input all of the available sequence data for a virus and quickly provides a set of functional primers. The flexible design of the primer design algorithm avoids a cumbersome alignment step in the process and facilitates regular updates with new sequence data. This is essential to avoid perpetuating a bias in the sequence data whereby sequences are obtained only if primers have functioned and primers are designed on antiquated data sets. The method showed a high success rate of full-genome sequencing of GII noroviruses, especially GII.4, which was predictable given that GII.4 genomes dominated the sequence data set used to de-  sign the primers. Future primer sets will be designed by using more targeted and updated genome sets and including more sequence data from other genogroups. Results obtained by this method have provided a large set of norovirus genome sequences derived from longitudinal samples from one location. At the start of this study, 265 full norovirus genomes were available in the GenBank database; this study added an additional 112 genomes. The data allowed the estimation of evolutionary rates for several genotypes, for full genomes, as well as for subgenomic regions. The evolutionary pressures and the constraints to avoid change are not expected to be uniform across the virus genome. Selection pressures are likely to vary greatly, depending on the function of the encoded proteins, with enzymatic and structural regions more constrained then surface-and immune-exposed or spacer regions with less-well-defined functions. The ORF-specific substitution rates estimated for the three phylogenetic clusters show that the ORF1 regions exhibited evolutionary rates lower than those of the ORF2 (VP1) regions.
Previous studies have estimated that norovirus GII.4 and GII.3 VP1 capsid regions evolve at 5.1 ϫ 10 Ϫ3 to 5.8 ϫ 10 Ϫ3 substitution per site per year (49)(50)(51), while it was estimated that the GII.4 polymerase region evolve at 4.33 ϫ 10 Ϫ3 to 8.98 ϫ 10 Ϫ3 substitution per site per year, depending on the data set used (49). Our estimates based on HCMC data are consistent with these previously published values. The evolutionary rate determined for GII.4 cluster 1 was higher than the estimated rates for GII.4 cluster 4 and the GII.3 cluster 5 viruses, perhaps because of a greater number of cluster 1 infections per unit of time and thus a greater number of replication events. Alternatively, the three virus genotypes might have intrinsically different replication properties, polymerase fidelity, or immune selection pressure that result in the differing rates.
The norovirus sequence data obtained from this study allowed an analysis of the evolutionary patterns of the second viral capsid protein VP2. The high evolutionary rates reported here (cluster 5, 7.38 ϫ 10 Ϫ3 substitution per site per year; cluster 1, 8.99 ϫ 10 Ϫ3 substitution per site per year) have not been observed previously, as this region was seldom included in previous sequencing projects. The structure of VP2 is not defined, although there is evidence that the protein is interior to the VP1 shell and may be important for assembly of the VP1 structure (52). The protein is moderately basic, and the C-terminal half of the protein is rich in serine and threonine residues (providing possible phosphorylation sites) and proline residues (perhaps accounting for the inability to define the structure of this protein). Evidence that changes in VP2 accompany changes in VP1 has been presented (48). Recently, MNV VP2 has been shown to influence the host immune response to the virus, with MNV1 VP2 interfering with antigenpresenting cell function and MNV3 VP2 promoting the response (53). These observations identify a possible site of virus-host interaction that could be a source of selective pressure. The evolutionary rates of the VP2-encoding regions were found here to be much higher than that of the well-studied norovirus VP1 region, and the higher rates are consistent with a less constrained protein product, stronger selection pressures, or both. Positive-selection analysis across the VP2 region identified position 144 as being under selection; this region of the protein was previously found to be involved in interactions of VP2 with VP1 (52). A high evolutionary rate in a virus capsid protein suggests a region of the virion experiencing immune selection. Vaccine development efforts should take this accelerated rate of change into consideration when selecting components for a vaccine.
Humoral immunity to norovirus (at least GII.4) may involve blockade antibodies that bind and block the VP1 residues required for binding to HBGAs (16,54,55). The correlation of high-titer blockade antibodies with protection from gastroenteritis in challenge studies (29) and the frequent evolution of these sites (blockade epitopes A, D, and E) suggest that these amino acid residues may be frequent targets of immune selection (55). Blockade epitope D may be directly involved in HBGA binding (16,54). Our observation of changes in VP1 position 298 epitope A, position 393 epitope D, and position 412 epitope E supports these previous conclusions. Several additional changes were located outside the blockade epitopes (S78G, S174P, G340E, and T502N, Fig. 5). Further studies should investigate whether these are founder effect changes of neutral consequence or if they provide an advantage for the virus.
Similar mean evolutionary rates for full genomes were found in clusters 1, 4, and 5, with 95% confidence interval ranges largely overlapping. One might expect a higher evolutionary rate for GII.4 viruses than for GII.3 viruses if the 10-fold higher detection frequency than GII.3 viruses directly reflects the community prevalence of these two infections. The similar full-genome rates suggest that either the number of active infections is not a large factor in the rate or that the less frequently diagnosed GII.3 infection is as frequent in the population as GII.4 but does not appear as frequently in clinics.
The ORF1-encoded p22 (NS4) regions showed a higher evolutionary rate than the full genome, and p48 (NS1-2) codon 79 was found to be under positive selection. The function of p22 (NS4) is not known, but the protein has been observed to localize to the Golgi compartment/endoplasmic reticulum (ER) and influence the host secretory pathway with a centrally located MERES (mimic of an ER export signal) motif required for localization (56,57). The function of p48 (NS1-2) in norovirus infection is also largely unexplored, although the protein is reported to localize to vesicles and has been proposed to influence protein trafficking (58). The evidence that these viral proteins interact with host proteins, combined with the higher evolutionary rate or positive selection described here, suggests that these proteins may interact with host restriction factors. Alternatively, these regions with higher rates of change could encode proteins with no constraint. Further studies are needed to clarify this.
Extensive work has been done with the feline calicivirus and MNV models to elucidate the roles and interactions of the nonstructural (NS1-7) and structural (VP1 and VP2) proteins in the regulation of virus replication and infectivity, as comprehensively reviewed in reference 1. However, functional profiling of human norovirus is not yet possible because of the lack of tissue culture and animal models for human norovirus replication. The fullgenome sequences of human norovirus available from this study provide valuable data on the spectrum of changes in the viral proteins allowed by the virus while awaiting alternative models for functional experiments.
This study has provided a description of norovirus evolution rates across HCMC over a 2.5-year period for the full genome, as well as for subgenomic regions, of the virus. We reveal for the first time a higher evolutionary rate in three regions of the genome (VP2, p22 [NS4], and p48 [NS1-2]) and provide evidence of positive selection in two coding regions (VP2 and p48 [NS1-2]). We suggest that these regions should be monitored for interactions with the host that might be a source of selective pressure. Finally, we believe that this study and the methods we have described will provide a useful template for community-wide studies of the fullgenome evolution of many RNA virus pathogens.