ABSTRACT
Intrahost and interhost assessments of viral diversity are often treated as measures of separate and distinct evolutionary processes, with numerous investigations reporting seemingly incompatible results between the two. For example, in human cytomegalovirus, the nucleotide diversity estimates are 10-fold higher for interhost data, while the number of segregating (i.e., polymorphic) sites is 6-fold lower. These results have been interpreted as demonstrating that sampled intrahost variants are strongly deleterious. In reality, however, these observations are fully consistent with standard population genetic expectations. Here, we analyze published intra- and interhost data sets within this framework, utilizing statistical inference tools to quantify the fitness effects of segregating mutations. Further, we utilize population level simulations to clarify expectations under common evolutionary models. Contrary to common claims in the literature, these results suggest that most observed polymorphisms are likely nearly neutral with regard to fitness and that standard population genetic models in fact well predict observed levels of both intra- and interhost variability.
IMPORTANCE With the increasing number of evolutionary virology studies examining both intrahost and interhost patterns of genomic variation, a number of seemingly incompatible results have emerged, revolving around the far greater level of observed intrahost than interhost variation. This has led many authors to suggest that the great majority of sampled within-host polymorphisms are strongly deleterious. Here, we demonstrate that there is in fact no incompatibility of these results and, indeed, that the vast majority of sampled within-host variation is likely neutral. These results thus represent a major shift in the current view of observed viral variation.
INTRODUCTION
The majority of evolutionary studies in the virus literature are divided into those examining intrahost data and those examining interhost data. Broadly speaking, intrahost studies deeply sample viral diversity from a limited number of biological samples (e.g., plasma, urine, or saliva), while interhost studies sample viral diversity from a large number of hosts, but with a shallow perspective of variation within each host. These studies are often treated as separate analyses of viral diversity. For example, studies of intrahost human cytomegalovirus (HCMV) populations report per-site nucleotide diversity values of ∼0.002, with approximately 160,000 segregating (i.e., polymorphic) sites in the ∼235,000-bp genome, while analyses of interhost data report per-site nucleotide diversity estimates of ∼0.02 and approximately 28,000 segregating sites (Table 1). This result has been interpreted to suggest that the great majority of polymorphisms sampled in intrahost data are strongly deleterious mutations, possibly carried on nonreplicative genomes (1). This heavy deleterious load in intrahost data has been taken as a lack of evolutionary stasis, with interhost diversity patterns interpreted as strong constraints on long-term evolution (2).
HCMV intrahost and interhost population summary statistics
However, under the commonly invoked neutral and nearly neutral models of molecular evolution (3, 4), many newly arising mutations will be strongly deleterious and will be eliminated by purifying selection, thus never reaching the appreciable frequency necessary to be sampled in population (i.e., intrahost) data. The remaining variation that can be sampled is predicted to have little if any impact on fitness and is thus governed by genetic drift. As such, most mutations will remain relatively rare in the population. In particular, under this equilibrium neutral model, half as many doubletons (i.e., mutations segregating in two copies) as singletons (i.e., mutations segregating as one copy) are expected, a third as many tripletons as singletons, and so on. This summary of variation is known as the site frequency spectrum (SFS). Indeed, across a wide array of organisms for which population data have been sampled, the majority of polymorphisms are present at low frequencies, producing a roughly L-shaped SFS, consistent with the model's expectation.
One may summarize and compare such intrahost population variation using a variety of statistics. They commonly include the total number of segregating sites, the SFS or the multiple statistics designed to summarize the SFS (e.g., Tajima's D [5] and Fay and Wu's H [6]), and the level of genetic differentiation between samples (e.g., FST). In the last case, particularly when making interhost comparisons, the common practice in virology involves summarizing each intrahost population with a consensus sequence. These multiple consensus sequences are then compared in order to estimate overall interhost diversity. As a natural consequence of the expected SFS, this consensus sequence represents only high-frequency alleles of a single intrahost population, making the interhost comparison a strongly ascertained sample of variation (Fig. 1). Here, we demonstrate that there is in fact no incompatibility in the observed results; standard population genetic models inferred from intrahost data well predict interhost observations. We also demonstrate that, indeed, most sampled variation is likely nearly neutral.
Cartoon depiction of intrahost and interhost data. On the left are cartoons showing two viral populations with low-frequency (blue) and high-frequency (red) derived polymorphisms. As shown, while intrahost populations are expected to harbor high diversity, the consensus sequence used for interhost comparisons includes only the small fraction of mutations segregating above 50% frequency.
RESULTS AND DISCUSSION
Population models of relevance.As seen in Fig. 1, the relative ratio between intrahost and interhost polymorphisms depends on the frequency distribution of polymorphisms within the intrahost populations. This site frequency spectrum is sensitive to a variety of factors, including the demographic history of the population, which in viruses is likely characterized by severe population size reductions associated with infection, followed by rapid population growth (i.e., population bottlenecks [7]). The frequencies of observed polymorphisms are also shaped by the effects of direct selection, in which the frequencies of mutations directly impacting fitness may be increased or decreased depending upon the beneficial or deleterious effect of the mutation. Such selective effects of segregating mutations are summarized by the distribution of fitness effects (DFE). In addition, observed mutational frequencies are shaped by linked selection, in which the frequencies of polymorphisms linked to directly selected sites may also be altered. Such linkage effects are described by the model of genetic hitchhiking in the case of linkage to a beneficial mutation (8), or by the model of background selection in the case of linkage to a deleterious mutation (9), and are naturally modulated by the extent of recombination between the sites in question.
All of these factors, except the distribution of fitness effects, have been studied and inferred in HCMV intrahost populations (10–12). The DFE characterizing HCMV intrahost populations is inferred here using two methods. First, the method of Eyre-Walker and Keightley (13) was employed to estimate the DFE of segregating mutations. This approach relies upon comparing the frequency of putatively neutral (e.g., noncoding) alleles to that of putatively selected (e.g., nonsynonymous) alleles as a means of estimating the fitness effects of selected sites. Frequency values for intrahost HCMV populations were obtained from previously published data (14). The fitness effects, expressed as |Ne × s|, where Ne is the effective population size and s is the selection coefficient, are binned into four classes: 0 to 1, neutral; 1 to 10, nearly neutral; 10 to 100, governed by selection; and >100, strongly governed by selection (15, 16). With this analysis, we estimate that the vast majority (97%) of segregating nonsynonymous alleles of HCMV intrahost populations are in the neutral class (Fig. 2A), with an average selection coefficient of −0.0024.
Estimates of the distribution of fitness effects of segregating alleles from intrahost data. (A) Absolute-value estimates of Nes from the Eyre-Walker and Keightley approach for inferring fitness effects of segregating nonsynonymous alleles across four patients. (B) WFABC estimates of fitness effects of segregating alleles in the same four longitudinal patient samples: 1254, 1587, MS1, and MS2 (11, 14). As shown, both approaches suggest that the majority of observed segregating mutations are neutral or nearly neutral.
In a complementary approach, we also estimated the fitness effects of all segregating alleles in longitudinally sampled populations using the Wright-Fisher ABC (WFABC) approach (17). Here, Ne and the locus-specific s for each segregating site are jointly estimated in an approximate Bayesian framework. The primary input for the analysis is the variance in allele frequency from time-sampled data. We have previously described four patients from whom samples were collected at 3 or more time points (14), and WFABC analysis was applied to segregating alleles from these samples. In total, 3,220 alleles were analyzed. The median estimated selection coefficient was −0.00422 (95% confidence interval, −0.00488 to −0.00355), and the distribution of fitness effects for segregating alleles is centered around neutrality (s = 0).
Taken together, these results consistently suggest that most segregating HCMV alleles are best characterized as nearly neutral and are thus predominantly governed by genetic drift.
Simulating interhost/intrahost dynamics.With these estimates of the distribution of fitness effects, one may quantify expectations regarding the relationship between intrahost and interhost diversities. Simulations were performed by modeling HCMV intrahost populations, which were then assessed for expected patterns of interhost diversity and finally compared with empirical observations. The simulations incorporated population genetic parameters inferred in a previous study of 48 HCMV intrahost populations (11). The parameters included the mean effective population size (Ne), as well as mutation and recombination rates. Further, four additional sets of simulations were performed, following previous reports suggesting that background selection and demography alter the patterns of polymorphism of in vivo HCMV populations (10). These simulations are identified as strong constraint, demography, demography/strong constraint, and demography/strong and weak constraint. For the strong-constraint simulations, a fraction (30%) of the genome was randomly identified as selectively constrained (following the inference of Renzette et al. [14]), so that all new mutations in these regions had a fitness of 0 (i.e., lethal) and did not produce progeny in subsequent generations. This model is thus consistent with Kimura's neutral theory (3, 18), in which alleles can either be wholly lethal or wholly neutral. For the nonequilibrium demography simulations, populations experienced a bottleneck 500 generations in the past that reduced the population size to 1% of the original size (i.e., at the time of infection), followed by a period of exponential growth, as estimated by Renzette et al. (11). The demography/strong-constraint simulations combined parameter values from the strong-constraint and demography simulations. Finally, demography/strong- and weak-constraint simulations incorporated all aspects of the demography/strong-constraint model but also allowed for weak fitness effects, the importance of which is strongly suggested by the above-inferred DFE. Specifically, fitness effects for mutations outside constrained regions (where all new mutations are lethal) were randomly drawn from a normal distribution (Fig. 2B). These simulations were thus consistent with Ohta's nearly neutral theory (4). For all simulations, each intrahost population was simulated 100 times, for a total of 4,800 simulated populations.
Comparing intrahost population dynamics: the site frequency spectrum.The first analysis of the simulated data was an estimation of the SFS under the different scenarios, as the SFS is expected to greatly affect the quantitative relationship between intrahost and interhost diversity (Fig. 1). Here, the SFS was estimated using a standard approach for the analysis of viral data. In many cases, a reference sequence for the viral species is used to identify polymorphisms of a sample, though it should be noted that the reference sequence traditionally represents the first or most accurate genome length data for a species. Thus, the method entails randomly selecting a consensus sequence from the total pool of a viral species and comparing intrahost populations to this sequence. We used an identical strategy with the simulated data set, in which a randomly chosen consensus sequence was denoted the reference sequence to which intrahost populations were compared. However, this analysis leads to a folded frequency spectrum because of the potential for misinference of the ancestral allele.
Figure 3 demonstrates that the folded SFS of the empirical data set differed from those of the simulated data sets in a few general areas. In the neutral and strong-constraint simulations, there was a dearth of low-frequency alleles (freq < 0.05) and an excess of midfrequency alleles compared to the empirical data set. It should be noted that the SFS of the strong-constraint simulations was not skewed compared to neutral simulations, consistent with the expectations of strong purifying selection (9). The inclusion of realistic demographic histories in the simulations clearly improved the fit of the simulations to the empirical data (Fig. 3). Further, the demography/strong- and weak-constraint simulations most closely matched the SFS of the empirical data set. This result likely reflects weak purifying selection targeting a subset of mutations, allowing background selection to skew the SFS (19).
Folded site frequency spectra of empirical and simulated data sets. As shown, neither equilibrium model well reproduces observed patterns, while the inclusion of nonequilibrium population estimates (i.e., demography) results in a much improved fit. Within that class, the inclusion of both strong and weak purifying selection (i.e., constraint) reproduces empirical observations particularly well.
Comparing interhost population dynamics: consensus divergence.Using these simulations, one can examine the expected interhost diversity of the simulated intrahost populations. For this purpose, we compared the consensus sequences from the populations and calculated the pairwise distances from the comparisons, a common analysis employed in interhost analyses, as described above. From the set of 100 simulations for each population, we randomly chose 10 consensus sequences to compare to 10 randomly chosen consensus sequences from every other simulated population. The results from the simulated data set were then evaluated against results from empirical data described previously (i.e., 41 consensus sequences, for a total of 820 pairwise comparisons). The influences of strong constraint and demography on the simulated pairwise distances are clear (Fig. 4). Imposing strong selective constraint on a fraction of the genome does not skew the SFS (Fig. 3) but reduces the total number of intrahost polymorphisms, consistent with a rescaling of Ne (9, 19), thereby reducing the interhost pairwise distances (Fig. 4). In contrast, the pairwise distance is increased under demography alone, where a larger number of population-specific fixations occurred, likely due to the effects of genetic drift during the bottleneck phase. The combined effects of demography and strong constraint result in a midlevel of pairwise distance, which is slightly elevated compared to the empirical results. Finally, including weak fitness effects reduces the median pairwise distance of consensus sequences and improves the fit of the simulated data to the empirical results (with the median pairwise distance obtained from this class of simulations equal to 0.01893 compared to a median empirical value of 0.01981). Furthermore, the distributions of values were similar in the two data sets.
Pairwise distances of consensus sequences. Utilizing the same six categories and color schemes as for Fig. 3, the mean and variance of interhost pairwise differences of both the observed data and models are plotted. As shown, the observed number of pairwise differences is a predicted outcome of the best-fitting intrahost model (i.e., demography with strong and weak purifying selection). Boxes represent the 95% confidence interval, and whiskers the 99% confidence interval.
Finally, the number of expected interhost segregating sites genome-wide was compared with empirical values, reported to be between 28,101 (10) and 31,528 (1). Similar to the pairwise distance results, selective constraint reduced the total number of interhost segregating sites and the variance of the statistic, while nonequilibrium demography increased this expectation. Further, the median values of expected segregating sites for both the demography/strong-constraint and demography/strong- and weak-constraint simulations were within the range reported from empirical studies (Fig. 5).
Interhost segregating sites from simulated data sets. Shown are box plots of the interhost segregating sites observed in various simulation scenarios. The red lines represent the empirical levels of interhost segregating sites reported previously (1, 10). As shown, the observed number of interhost segregating sites is a predicted outcome of the best-fitting intrahost model (i.e., demography with strong and weak purifying selection). Boxes represent the 95% confidence interval, and whiskers the 99% confidence interval.
In sum, the diversity of the interhost data set is accurately reconstituted using the inferred intrahost population genetic models.
Conclusions.Here, we examined the relationships between intrahost and interhost populations and analyses. HCMV data were utilized to illustrate this principle, as much previous work has elucidated critical population genetic parameters in the virus, including mutation and recombination rates, and intrahost population demography. However, these results may be evaluated in the larger field of viral evolution. Traditionally, viral diversity and evolution were studied at the interhost level, largely because these data were readily accessible. For example, Sanger sequencing of clinical isolates provided data on the consensus sequence of a virus sampled from many hosts, though any intrahost diversity was largely unexplored with these techniques. Advances in both sequencing technology and computational and statistical analyses have shed new light on the intrahost population dynamics of viruses with unprecedented levels of detail and accuracy. This has also created an unnecessary dichotomy in the field, where intra- and interhost studies are often envisioned as separate analyses focusing on somehow uncoupled aspects of viral evolution. However, as this study demonstrates, interhost diversity is a clear result of the inferred intrahost population genetic dynamics, in which the best-fitting model incorporates inferences from previous studies, including information on selectively constrained regions of the genome, nonequilibrium demographic histories, and the effects of background selection. Viewing interhost/intrahost data in this framework allows previously paradoxical results to be seen as highly predictable.
MATERIALS AND METHODS
Data.The HCMV intrahost empirical data set and population summary statistics were described previously (14) and consist of next-generation sequencing data collected from 48 HCMV-positive clinical samples. Likewise, the interhost data set and summary statistics were described previously (10) and consist of all the available whole genomes of minimally passaged HCMV isolates from GenBank (n = 41) available at the time of the original study. As described in these publications, polymorphisms were discarded if the base call quality was <20 or if the local sequence depth was <15. They were then filtered based on platform- and base-specific error rates calculated by sequencing an error control. Finally, P values were calculated for all polymorphisms, assuming a Poisson distribution of error frequencies, and variants with P values of >0.01 were excluded. Summary statistics from a second, previously described interhost data set consisting of 96 genomic sequences (1) were also included.
Time-sampled inference of the DFE.We utilized an approximate Bayesian framework to infer the viral effective population size and the DFE, using the program WFABC (17, 20, 21). In brief, allele frequencies were inferred from multiple time point samples, and the variance in allele frequencies was used to estimate viral population size and per-site selection coefficients in a two-step process, assuming independence between sites. Sequence data from the first time point sample were used to infer the ancestral sequence, thus allowing frequencies of derived alleles to be inferred over time. Candidate allele trajectories in which the derived allele reached a frequency ≥0.05 were included in the analysis, and tri- and quadallelic loci were excluded. The software can be found online (http://jjensenlab.org/software ).
Single-time-point inference of the DFE.A second, independent method was used to infer the distribution of fitness effects of nonsynonymous alleles, again assuming independence between sites. Specifically, the approach of Eyre-Walker and Keightley (13), extending the method of Eyre-Walker et al. (22), was used to analyze publicly available data sets of HCMV intrahost polymorphisms. Alleles were classified as synonymous or nonsynonymous using the HCMV reference sequence (NC_006273.2 ). The software can be found at http://www.lifesci.susx.ac.uk/home/Adam_Eyre-Walker/Website/Software.html .
Simulation.Computer simulations of HCMV intrahost populations were performed using the SFS_code forward simulator (23) under a range of population models described below. For each population model, a simulation of 48 isolated populations was performed for a region of DNA equal to 11,782 bp, which is approximately 1/20 the size of the HCMV genome (235,646 bp). The genome size was scaled to increase throughput of the simulations, as forward simulations are computationally intensive and a large number of simulated populations were analyzed in this study. The effective population size, as well as recombination and mutation rates for each simulated population, were set equal to those values previously estimated for HCMV intrahost populations (11, 14). Summary statistics from each of the simulated populations were then combined and analyzed analogously to the empirical intrahost data sets. Further, polymorphism data from the simulated populations were used to evaluate the levels of interhost (i.e., interpopulation) divergence through pairwise comparisons of all simulated populations. To simulate strong constraint, the “-c” command line option was used to set the fraction of new mutations that are lethal (i.e., fitness = 0) equal to 0.30. Simulations including demography were performed, in which the population experiences bottlenecks 500 generations in the past, reducing population sizes to 1% of previous levels (with the “-Td” command), followed by exponential growth (with the “-Tg” command) to current levels (following the demographic model inferred in reference 11). Weak constraint was included in simulations with the “-selDistType” option set to “type 3” (i.e., normal distribution), in which the mean and variance of the distribution from which selective effects were drawn in the simulations were set to the mean and variance of the distribution of fitness effects inferred from empirical data with the WFABC approach (Fig. 2). For all simulations, sample sizes were set to 100 and ploidy was set to 1. The software can be found at http://hernandezlab.ucsf.edu/software/ .
ACKNOWLEDGMENTS
We thank Kristen Irwin for careful readings of the manuscript.
This work was funded by grants from the European Research Council (ERC) and the Swiss National Science Foundation (FNS) to J.D.J. T.F.K. received funding from the National Institutes of Health (R01HD061959 and R01AI109001).
The content is solely our responsibility and does not necessarily represent the official views of the National Institutes of Health.
FOOTNOTES
- Received 12 October 2016.
- Accepted 1 December 2016.
- Accepted manuscript posted online 14 December 2016.
- Copyright © 2017 American Society for Microbiology.