Previous Article | Next Article ![]()
Journal of Virology, October 2004, p. 11296-11302, Vol. 78, No. 20
0022-538X/04/$08.00+0 DOI: 10.1128/JVI.78.20.11296-11302.2004
Copyright © 2004, American Society for Microbiology. All Rights Reserved.
Department of Biostatistics, UCLA School of Public Health,1 Department of Biomathematics, David Geffen School of Medicine at UCLA, Los Angeles, California,4 Wadsworth Center, New York State Department of Health, Albany,2 Montefiore Medical Center, Bronx, New York3
Received 25 February 2004/ Accepted 8 June 2004
|
|
|---|
|
|
|---|
Phylogenetic analyses have traditionally been used to reconstruct the evolutionary histories of contemporaneous sequences. To address questions of intrahost evolution, however, it becomes necessary to examine the evolutionary relatedness of sequences sampled at different times. In general, such analyses do not allow for the estimation of intrahost evolution across multiple patients simultaneously. Many studies either calculate rates of nucleotide substitution and infer trees for each set of sequences independently or, alternatively, construct a single multipatient tree governed by the same substitution process for all individuals.
In one such analysis of viral evolution following primary infection, changes in the V3 region of HIV-1 env were examined by independently constructing phylogenetic trees for each of eight patients (16). A similar study examined evolutionary changes in the C2-V5 portion of env by constructing a single multipatient phylogenetic tree (27). In these two studies, the impact of antiretroviral therapy on HIV-1 evolution was not addressed. Frost et al. (7) examined the divergence and diversity of HIV-1 envelope at two time points: before the initiation of highly active antiretroviral therapy (HAART) and 2 years into HAART. They found strong evidence of divergence and positive selection in two of six patients. Viral evolution under HAART has also been studied to examine viral reservoirs, latency, and drug resistance (8, 24, 38, 39). In this study, we examine envelope evolution under antiretroviral therapy across multiple patients simultaneously.
To investigate the forces shaping HIV-1 evolution, we extended the Bayesian hierarchical phylogenetic model of Suchard et al. (29) by incorporating gamma-distributed rate variation across sites. Within the analytical framework outlined in the Bayes theorem, all information or knowledge about a data set is taken into account; the initial probability assessment of the evolutionary parameters (the prior distribution) is used to update the probability estimates obtained after the data are analyzed (the posterior distribution). Further, Bayesian techniques are adept at handling averaging across discrete quantities that have uncertainty, such as multipatient phylogenetic analyses, where evolutionary parameters, like trees, may vary from patient to patient (20).
We used our Bayesian model to examine a recently recognized phenomenon with direct relevance to antiretroviral treatment: changes in HIV-1 coreceptor usage during antiretroviral therapy. HIV-1 strains transmitted in vivo generally use the coreceptor CCR5 (R5 viruses), but strains using CXCR4 (X4 viruses) later emerge in approximately 50% of infected individuals (2, 18, 26, 27). The appearance of X4 strains predicts an acceleration of HIV-1 disease progression (2, 18, 26, 27). The natural course of disease progression, however, can be altered by antiretroviral therapy (1, 23). It has been shown that potent antiretroviral therapy can preferentially suppress X4 strains of HIV-1 in patients, shifting the viral population back to R5 after treatment (25). It is not known whether the R5 isolates that predominated after therapy evolved from a predecessor R5 strain or from circulating X4 virus.
In the present study, we investigated HIV-1 sequence evolution, examining changes in coreceptor usage in patients who received antiretroviral therapy but did not achieve complete virologic suppression. We tested for common patterns in intrahost viral evolution by using a Bayesian hierarchical model that allows for interpatient variation while estimating phylogenetic tendencies across patients simultaneously. Using this model, we found that the R5 virus present after therapy was most closely related to predecessor R5 virus. It is unlikely that this virus arose through continued evolution of X4 strains. These analyses illustrate the applicability of our hierarchical Bayesian method for evolutionary studies of HIV-1 and other viruses.
|
|
|---|
Virologic and immunologic characteristics of the patients are listed in Table 1. Patients 17 and 28 were described previously (17). Patients 9 and 17 were also described in a separate study, where they were referred to as subjects 5 and 8, respectively (25). Human genotypes for the CCR5 coreceptor were previously determined; all patients were homozygous for the wild-type coreceptor (5).
|
View this table: [in a new window] |
TABLE 1. Clinical, virologic, and immunologic characteristics of patients
|
represents the proportion of viruses using CCR5:
= 1 represents an isolate in which all strains prefer CCR5 whereas
= 0 indicates that all isolates prefer CXCR4. We selected three time points as follows: at the first time point the patient quasispecies was dominated by R5 strains of HIV-1, defined as
0.5. At the second time point, the patient exhibited a higher proportion of X4 strains, as suggested by a decrease in
from the previous time point. The last time point selected was either after initiation of antiretroviral therapy or following a change in treatment, when each patient exhibited a high proportion of R5 strains. No patient achieved complete virologic suppression, regardless of antiretroviral regimen. Determination of coreceptor usage of biological clones. Primary isolates of HIV-1 were obtained by coculture of the patients' peripheral blood mononuclear cells with healthy donor peripheral blood mononuclear cells; biological clones were then derived from primary isolates by short-term limiting dilution cloning (25). On average, there were 25 (range, 22 to 27) clones per patient time point. The coreceptor phenotype of each clone was determined by the HOS-CD4 assay (25). Determination of coreceptor utilization was based upon a functional, biologic assay rather than genotypic analysis of viral env sequences, because such a sequence-based approach might lead to inadvertent phenotype attraction during phylogenetic reconstruction. Phenotypic attraction may occur when the genotype of the organism that is under analysis codes for the same phenotype that is the basis for inclusion in the study. That is, if we use the sequence information to determine the phenotype and then use the same genotypic information to determine the phylogenetic relationship, we can get more clustering among phenotypes than would be otherwise expected by chance, thus biasing the true phylogenetic relationships among the sequences.
PCR amplification, cloning, and sequencing. Reverse transcription-PCR was used to clone the V1-C5 portion of the HIV-1 env gene from each biological clone with use of primers HX6556F (5'-ATGGGATCAAAGCCTAAAGCCATGTG-3') and HX7792R (5'-AGTGCTTCCTGCTGCTCCCAAGAACCCAAG-3'). Measures to ensure representative HIV-1 env clones were performed (14). High-fidelity rTth DNA polymerase was used to limit sequence variation due to misincorporation errors. Limiting dilution reaction conditions designed to minimize the likelihood of recombination between molecules during PCR amplification were used (6). Cloned env fragments were sequenced using an automated DNA sequencer (Applied Biosystems).
Phylogenetic analysis.
To account for the interhost variability and uncertainty in describing viral evolution among multiple hosts, we used a Bayesian hierarchical phylogenetic model (29). The general idea behind Bayesian analysis is to combine what is known about the data and parameters a priori (called the prior) to obtain updated (posterior) probability assessments of the hypotheses. In this way, what is known scientifically or what is believed by experts can be incorporated into the overall analysis. The main objective of the Bayesian approach is to estimate the posterior distribution of the unknown model parameters given the observed phylogenetic data, where the posterior distribution also includes prior knowledge about the parameters and the data. This is accomplished using Bayes theorem. Given a statistical model with parameters
and data Y, Bayes theorem states that the posterior distribution of
is proportional to the sampling density of the data given
(commonly referred to as the model likelihood) multiplied by the prior distribution of
:
![]() |
incorporates any knowledge that we may have about
before observing the data and the denominator p(Y) is a normalizing factor that ensures that the posterior density integrates to 1. Bayes theorem thus specifies how beliefs about
change from before to after an experiment is performed. Various competing models or hypotheses may be described by restricting one or more parameters in
to specific ranges of values and then compared using Bayesian model selection (10, 19, 20, 31, 35). Bayesian approaches are advantageous in phylogenetics because the primary analysis provides an estimate of the tree and measurements of uncertainty often faster than estimates that can be obtained by using maximum likelihood bootstrapping techniques (14, 15, 31). Bayesian analysis can also provide estimates of complex models where maximum likelihood methods would be intractable or too computationally burdensome. Finally, Bayesian model selection offers advantages over likelihood methods in that the competing evolutionary hypotheses need not be nested and the Bayesian approach does not rely on standard likelihood asymptotics (32, 35).
The Bayesian hierarchical phylogenetic model allows information from all patients to be used simultaneously to improve estimate precision in individual patients. The general idea behind hierarchical modeling is to think of the smallest (lowest-level) unit as being organized into a hierarchy of successively higher units. In this case the within-patient distributions are the smallest unit and the between-patient distributions are the top level. Effects at the lower level can be thought of as one of an exchangeable collection of effects that is drawn from a common distribution at the higher levels. When data are sparse, pooling information via a hierarchical model smoothes the parameter estimates of the evolutionary model, providing more precise estimates than analyses that treat each patient as an isolated entity. The hierarchical framework also permits the estimation and testing of patterns across patients while allowing the uncertainty in each individual's evolutionary parameters to vary. This is accomplished by constructing hierarchical priors on the parameters within a patient. The model thus combines the results from the individual patients to provide population-level summaries. The population-level model and the patient-level models are fit simultaneously, enabling the population-level model to feed back information, in the form of a prior, to help in the estimation of the individual patient parameters. In regression terms, the population-level parameters can be thought of as fixed effects and the patient-level parameters can be thought of as random effects that describe how individual patient parameters may vary from the population-parameter means.
To reconstruct an evolutionary tree at the patient level, we employed a continuous-time Markov chain model for nucleotide substitution that assumes that the substitution mechanism is independent across branches in the patient-specific tree and is a memory-less process. The probability of nucleotide X mutating to Z, where X, Z = {A, C, T, or G}, along a branch with length t equals exp (t
)X,Z, where
is a 4 x 4 infinitesimal rate matrix similar to that of Tamura and Nei (33). Matrix
is constrained so that the branch lengths are expressed in terms of the expected number of nucleotide substitutions per site (37). Since branch lengths may not retain definition between trees, we modeled branch lengths tis within patient i as Exponential(µi), where s = 1,...,5. The prior expected divergence, µi, is interpretable across trees, and thus we can share branch length information across patients. Site-to-site rate variation follows a discrete gamma distribution with 4 bins and shape parameter
i (36). Let
i = (
i, tis,
i,
1i,
2i, µi,
i) for patient i, where
i is the unknown tree relating all of the sequences from patient i,
1i is the transition/transversion rate ratio for transitions between the purines,
2i is the transition/transversion rate ratio for transitions between the pyrimidines, and
i are the stationary nucleotide frequencies. We extend over i = 1,...,N multiple patients simultaneously by assuming the following hierarchical priors:
![]() |
![]() |
![]() |
= diag(
2
,
2k1,
2k2,
2µ). The priors state that the log of the rate variation parameter
i, the average divergence µi, and the transition/transversion rate parameters
1i and
2i are multivariate-normally distributed across patients with estimable population-level mean vector M and variance-covariance matrix
. The nucleotide frequencies,
i, came from a common distribution with an estimable mean
and precision N
, where a Dirichlet is the natural parameterization for proportions. The trees,
j, drawn with estimable population-level probabilities T, where T = (T1,...,TC), are the population-level probabilities for each of the C = 3 (in this case) possible trees. For hyperpriors on the population-level parameters, we took T
Dirichlet (NQ x Q) where Q is the prior probabilities of the C different trees and NQ is the weight given to the prior probabilities Q; here we take NQ = 1. We took vague priors over (M,
, N
,
). For example, M was a diffuse normal and
was a flat Dirichlet. We incorporate half-Cauchy priors on
2
,
2k1,
2k2, and
2µ, as the half-Cauchy has been recently shown to have more desirable properties than the inverse gamma as an uninformative prior when the number of groups is small (9). Each model was run for 300,000 iterations. After discarding the first 50,000 iterations as burn-in, we subsampled every 25th iteration to yield 10,000 posterior samples with decreased autocorrelation. The largest model ran in 3 h and 48 min on a 1-GHz Pentium III PC. The burn-in times and total Markov chain Monte Carlo (MCMC) chain lengths were significantly longer than what appeared to be required from examining model log-likelihood time-series plots (3). The Bayesian hierarchical phylogenetic model is available for download at http://www.biomath.ucla.edu/msuchard.
Nucleotide sequence accession number. The sequences of the cloned env fragments have been deposited in GenBank under accession numbers AY322081 through AY322106.
|
|
|---|
R5 strains that persist after antiretroviral therapy (R5-r) may have evolved either from earlier R5 strains (R5-p) or from circulating X4 virus. Three possible phylogenetic trees can be constructed from a data set containing four sequences: R5-p, R5-r, X4, and an outgroup sequence (Fig. 1). When R5-r is most closely related to X4 in a phylogenetic tree (Fig. 1, tree X), the hypothesis that R5-r evolved from X4 virus is favored. If R5-r and R5-p are neighbors (Fig. 1, tree R), there is evidence supporting evolution of reemergent R5 virus from earlier R5 strains. The final tree has R5-r and an outgroup as neighbors (Fig. 1, tree O) and is unlikely a priori. The Bayesian hierarchical model was fitted such that the prior probabilities of trees R and X were equal and were 10 times more likely than tree O. This is equivalent to setting the prior for each tree to Q = (QR, QX, QO) = (0.45, 0.45, 0.10).
![]() View larger version (8K): [in a new window] |
FIG. 1. Possible phylogenetic topologies relating three longitudinal sequences. The first sequence is taken at the initial time point, when R5 strains dominate the viral quasispecies (R5-p); the second sequence is taken at an intermediate time point when X4 strains are circulating (X4); and the third sequence is collected at the final time point, following initiation of potent new therapy and a concomitant suppression of X4 strains, resulting in a shift back to predominantly R5 strains (R5-r). The fourth sequence is an outgroup (O).
|
1i and
2i, expected divergences µi, gamma shape parameter
i, and stationary distributions
i are also given in Table 2. The table illustrates the variability between corresponding parameters. Expected divergences demonstrated considerable variability between patients, with patient 9 demonstrating the highest rate of evolution and patient 17 demonstrating the lowest. The posterior probabilities provided estimates of the likelihood that each tree is correct, given the data and the outgroup chosen.
![]() View larger version (11K): [in a new window] |
FIG. 2. Most probable phylogenetic tree for each patient, with use of HXB2 as the outgroup. Branch lengths are drawn to scale across patients and represent posterior mean estimates.
|
|
View this table: [in a new window] |
TABLE 2. Hierarchical individual-level estimates for each patient, with HXB2 used as the outgroupa
|
![]() |
0.03 (12). |
View this table: [in a new window] |
TABLE 3. Hierarchical population-level estimate for the probability of each tree for all patients given the outgroupa
|
0.95 in all patients. Table 4 lists the estimated probability for each model tree. The Bayes factor was 20.69 in favor of continued evolution of reemergent R5 virus from earlier R5 strains, P
0.03. Similar results were obtained when JR-CSF, an R5 virus, was used as the outgroup (data not shown). |
View this table: [in a new window] |
TABLE 4. Hierarchical individual-level estimates for each patient, with HXB2 without V3 used as the outgroupa
|
|
View this table: [in a new window] |
TABLE 5. Analysis of the data with each patient treated independently and with sequence data for each time point concatenated so that a single evolutionary tree is createda
|
We compared model fit using the deviance information criterion (DIC), which is analogous to the Akaike information criterion (28). The DIC utilizes a measure of model fit, D, and a measure of model complexity, pD. Minimizing D + pD yields the most parsimonious model with the best fit; thus, the model with the smallest value for DIC is the model with the best fit. D equals 2 times the mean posterior log likelihood, and pD equals D
, where
is 2 times the log likelihood at the posterior mean. For the trees, we took the mean branch lengths of the mode topology. We used this criterion to compare the hierarchical model to the concatenated model. For the hierarchical model, D was equal to 16,925 and pD was equal to 26, yielding 16,951 for the DIC estimate. The value for D for the concatenated model was equal to 16,970, and the value for pD was 10; thus, the DIC estimate was 16,980 for the concatenated model. From the DIC estimates, we found that the hierarchical model had a better fit than the concatenated model.
To assess the performance of our sampler's mixing in estimating the posterior probability of a particular tree, we used scaled regeneration quantile (SRQ) plots (21, 30). Regeneration times are defined as the time that it takes the Markov chain to return to a specific topology. Tours of the chain between these times do not have to be completely independent and identically distributed to assess the performance of our sampler (21). The slope between two scaled regeneration times in an SRQ plot is the ratio of the posterior probability estimate based on the entire chain to the estimate based on the chain between these tours. Substantial deviations from a slope of 1 indicate that the chain is not long enough. Figure 3 shows a representative SRQ plot from the MCMC chains. We superimposed a dashed line with slope equal to 1 and found that there is no large deviation. This suggests that our chains were mixing well and were sufficiently long enough to produce stable posterior estimates. We make our complete datasets available to interested readers at http://www.bol.ucla.edu/~cr/Datasets.htm.
![]() View larger version (7K): [in a new window] |
FIG. 3. SRQ plot to assess model performance in estimating the posterior probability of tree X under the hierarchical model. The absence of a substantial deviation from a slope of 1 (dashed line) implies that the chain is long enough to produce stable estimates.
|
|
|
|---|
We focused on patients whose viral populations in the blood displayed a shift in HIV-1 coreceptor utilization in response to therapy. In each case, although X4 strains emerged, the viral population later reverted to R5 following initiation of new therapeutic regimens. By phylogenetically reconstructing the evolutionary relationship of HIV-1 obtained longitudinally from each patient, it was possible to examine the origin of the reemergent R5 virus.
By using our hierarchical approach, we found evidence that the R5 virus reemerging after antiretroviral therapy evolved from predecessor R5 strains rather than from X4 virus. All analyses supported this model, regardless of the outgroup chosen. Our method also allowed us to detect evolutionary patterns that could have been missed by more traditional phylogenetic techniques. Analyzing each set of patient sequences independently allowed each individual to have his or her own evolutionary parameters but did not take into account what was known about the other patients. Because of not allowing pooling of information across patients, estimates of each patient's evolutionary parameters had a wider confidence interval than did estimates obtained from the hierarchical model. Furthermore, while the independent method allows probability calculations for each individual phylogenetic tree, population-level estimates of support for competing evolutionary models cannot be determined. Analysis of concatenated sequences, on the other hand, allows patients to be examined simultaneously but also forces all patients to have the same evolutionary parameters, an assumption that may not be justified.
The Bayesian hierarchical model represents an alternative to analyzing sequences obtained from each patient separately or concatenating all of the sequences together. The superiority of this model is illustrated in its ability to simultaneously estimate patient parameters while allowing for individual variation in the evolutionary parameters. The sharing of information across patients allows for narrower errors in parameter estimates compared with other methods. However, this comes at the cost of bias due to shrinkage effects. Finally, this method can calculate the overall probability of competing hypotheses across patients. This aspect of the model can become especially important in studies where there may be considerable variation in the support for each tree across patients.
Our analyses were based upon RNA sequences of biologically cloned HIV-1. Although culture of HIV-1 can result in selection of minor variants, it is unlikely that the brief series of in vitro passages performed here significantly affected the coreceptor utilization of these biological clones. Even following repeated passage of HXB2 and JR-CSF isolates in our laboratory; for example, no change has ever been observed in the coreceptor utilization of these viral stocks (data not shown). It is important that, although our phylogenetic reconstructions were performed using env sequences, the coreceptor phenotype for each viral isolate was determined independently. Sequence selection was initially based on biological phenotype, eliminating any bias that may have arisen from genotypically predicting coreceptor utilization and then using the same sequences to estimate evolutionary relationships. To ensure that our results were not based on a sampling artifact, we sequenced another random sample of clones at each time point and reran the analysis. Support for tree R was again supported with similar probability (data not shown).
Our results support the hypothesis of reemergence of predecessor strains following a major change in therapy. These results, however, were obtained from study of a limited number of patients. None of the individuals studied here achieved complete viral suppression even while receiving HAART (4, 13, 25). It is possible that the evolutionary pattern of the reemergent virus may be different in patients who have sustained viral suppression. Understanding viral evolution under the selective pressure of antiretroviral therapy is an important step in understanding viral dynamics, drug resistance, and the durability of clinical response to treatment. Additional studies of viral evolution in well-characterized patients are needed.
Finally, this paper illustrates the usefulness of our Bayesian hierarchical model. This model is not limited to phylogenetic analysis of HIV evolution. Rather, this model is generalizable to situations where it is necessary or desirable to pool sequence information across multiple patients to improve estimate precision of individual patient parameters while simultaneously estimating shared patterns across patients.
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»