Previous Article | Next Article ![]()
Journal of Virology, February 2006, p. 1637-1644, Vol. 80, No. 4
0022-538X/06/$08.00+0 doi:10.1128/JVI.80.4.1637-1644.2006
Copyright © 2006, American Society for Microbiology. All Rights Reserved.
Departments of Microbiology,1 Medicine, University of Washington School of Medicine, Seattle, Washington 98195-80702
Received 20 August 2005/ Accepted 21 November 2005
|
|
|---|
|
|
|---|
It is also of critical public health importance to understand the size, growth rate, and distribution patterns of the HIV pandemic. In this context, interhost sequence data sets are analyzed with population genetic and genealogic methods to infer the past and present population dynamics of HIV (11, 33, 46), to measure HIV diversity in emerging local epidemics (1, 46), and to date the introduction of HIV-1 into the human population (18, 34). The main group of HIV-1, group M, the cause of most cases of AIDS worldwide, can be classified into nine subtypes and at least 16 circulating recombinant forms (22). Most epidemiological studies have focused on specific countries and the particular subtypes occurring in those countries, revealing large differences among countries and geographic regions (1, 34).
Biological processes that can influence patterns of genetic diversity and evolution in a population include mutation, recombination, migration, natural selection, and genetic drift. For HIV, as a pathogen with interhost transmission, an additional process involves the geography and dynamics of human interaction. All of these processes may affect the patterns of variation observed in HIV interhost sequence data sets. However, HIV accumulates genetic diversity only within individual hosts. The diversity of the intrahost viral population is generally low after transmission and increases during the course of infection at a rate of approximately 1% per year (in the C2 to V5 region of the envelope glycoprotein gene [36]), with sequence pairs reaching at least 15% difference in long-term survivors.
A main assumption in evolutionary analyses of HIV interhost sequence data sets is that genetic divergence accumulated within hosts is maintained through transmission events. If we accept this assumption, it follows that HIV interhost divergence from the estimated most recent common ancestor (MRCA) increases through time, with no punctuation at interhost transmission, and regardless of the time each viral lineage spends in an infected individual. In other words, all viral lineages in all infected individuals would be expected to be equally divergent from the MRCA at any given moment in time, regardless of the duration of infection within a given individual or how many transmission events have occurred in a given viral lineage. One then predicts that any two distinct, contemporaneously sampled interhost data sets from comparable cohorts (including treatment regimens, if applicable) within a local epidemic will contain equivalent levels of genetic diversity and divergence.
There is sufficient biological evidence to question the above assumption. For example, some cytotoxic T-lymphocyte (CTL) escape epitopes are known to revert after transmission to a new host (see, for example, reference 23), CXCR4 receptor-utilizing (X4) viruses are notably absent during early infection (see, for example, references 36 and 42), and intrahost viral populations undergo homogenization in acute infection (21), irrespective of the route of transmission (37). While the degree of genetic heterogeneity observed in primary infection differs between studies (7, 25, 44, 47), it has long been accepted that HIV populations undergo one or more substantial bottlenecks during early infection (45). For example, a recent study that documented viral genetic variability in eight heterosexual transmission pairs clearly demonstrated that an extreme bottleneck can accompany transmission from donor to recipient (8). Despite these observations, the stage of illness has not been explicitly considered as an important variable in analyses of interhost sequence data sets.
In this study, we examined the possible impact of HIV intrahost evolution on estimates of HIV interhost genetic diversity and divergence. First, we used data sets from three distinct subject cohorts representing different times since infection but sampled within the same calendar year (Multicenter AIDS Cohort Study [MACS] and University of Washington Primary Infection Cohort [PIC]) or roughly the same calendar period (Lyon cohort). Second, we used known transmission pairs to compare viral divergence from the interhost MRCA in the donor and the recipient. Third, we examined intrahost viral evolution in a longitudinal cohort.
|
|
|---|
|
View this table: [in a new window] |
TABLE 1. Estimates of mean pairwise divergence (corrected and uncorrected ) and for HIV-1 subtype B interhost data sets
|
Although the PIC sequences are not epidemiologically linked and are, therefore, a valid comparison to MACS 1993 sequences for HIV-1 subtype B diversity and divergence estimates, we included another interhost set of primary infection sequences for corroboration. Ataman-Onal et al. (5) analyzed a cohort of 11 subjects in Lyon, France, with symptomatic primary infection of HIV-1 subtype B. We used previously published env sequences from eight of these subjects sampled in 1993 (n = 5), 1994 (n = 2), and 1995 (n = 1) and aligned these sequences to our PIC and MACS env data set. We found no epidemiologic linkage in this cohort using the tests described above. Using the Lyon primary infection sequences, we created a single interhost data set for comparison to our 10 PIC 1993 interhost data sets.
Diversity and divergence measurement.
Population dynamics can be described using summary statistics such as the mean pairwise diversity,
, and the neutral parameter
(defined as 2Neµ in a haploid population). In
the effective population size (Ne) and the substitution rate (µ) are conflated, yet estimating either parameter (or estimating
values where µ is not thought to vary) can provide useful epidemiological information. For each replicate interhost data set (all nucleotides, synonymous and nonsynonymous) we used SITES (43) to estimate
(uncorrected) and
, calculated using both
(
) and the number of segregating sites (
W). We then used Modeltest with the Akaike information criterion to identify appropriate substitution models for each data set. These substitution models were used to estimate maximum likelihood-corrected mean pairwise divergences in PAUP* version 4.0b10 (41); we also estimated HKY85 corrected distances. Because of the relatively small data sets (nine sequences in each replicate) and the potential for diversity estimates to be skewed by outlying subjects, we performed a jackknife procedure on all data sets by removing single sequences from each replicate and reestimating diversity values. We found no significant differences in
or
between complete and jackknifed data sets.
We used PAUP* for maximum likelihood genealogy reconstruction for the MACS and PIC 1993 data sets, using the specific substitution models previously chosen in Modeltest. We produced two trees for each replicate, with and without a molecular clock enforced. From the genealogy without a clock enforced, we recorded the lengths of all external branches.
Genealogies and the coalescent approach. Coalescent methods focus on the rate at which lineages in a population coalesce backwards in time, making it possible to infer past population dynamics from a genealogy (17). This rate of lineage coalescence is related to evolutionary and demographic processes such as changes in effective population size, selection, migration, and recombination. These processes are reflected in the distribution of coalescent events in a genealogy and the distribution of substitutions along lineages (38). The coalescent iswidely used to describe HIV population dynamics (11, 33, 46). Classic skyline plots are graphical depictions of coalescent analyses (33) that depict the relationship between time, measured in substitutions per site per year, and the coalescent estimate of Neµ at time t. Estimates of Ne in coalescent analyses of HIV interhost data sets are intended to represent the effective number of infected individuals (11). For the MACS and PIC 1993 genealogies with the molecular clock enforced, we produced classic skyline plots with the Genie software package (32).
Transmission pairs. The study of donor-recipient transmission pairs can provide useful insights into the effects of transmission and primary infection on the evolution of HIV. To examine this, we utilized the HIV-1 V1 to V5 envelope region gene sequences from seven subtype C heterosexual pairs from Zambia (8). Donors had been infected for times ranging from 0.3 to 4.0 years before sampling. Recipients were sampled within 3 to 4 months of their last seronegative visit. This cohort has been described more fully elsewhere (8).
We created separate donor and recipient sequence data sets for each transmission pair, each with 12 subtype C reference sequences, two subtype A outgroups, and five sequences chosen randomly from all sequences of each subject. We reconstructed genealogies using PAUP* by both maximum likelihood and neighbor-joining methods with the HKY85 distance correction, employing Modeltest and the Akaike information criterion to choose appropriate likelihood substitution models for each individual data set. We estimated divergence in donor and recipient viral lineages by summing branch lengths from lineage tips to the node connecting the subtype C lineages to the subtype A outgroups, using the program TreeEdit, version 1.0a10.
Intrahost evolution. We estimated intrahost divergence of HIV-1 env C2 to V5 sequences over the course of infection for each of the nine MACS patients in two separate ways. First, we calculated mean pairwise distances between viral sequences at each time point and the consensus sequence from the initially infecting population (first virus-positive time point). Second, we calculated mean distance between viral sequences at each time point and the deduced MRCA sequence (10, 31) for a given patient. This ancestral sequence was derived using PAUP* (40) from the basal node on a tree of patient sequences from all time points, using an outgroup composed of a single first time point sequence (the sequence closest to the Los Alamos National Laboratory HIV Database subtype B consensus [19]) from each of the other patients. Sequence data and patient numbers have been previously described (36).
|
|
|---|
values of 0.112 and 0.137, respectively (Table 1) (P< 0.001, two-tailed Mann-Whitney U test). The Lyon primary infection time point data set, including 1993, 1994, and 1995 sequences, also contains less diversity than the MACS late time point data set, with an uncorrected
of 0.119 (Table 1). Thus, the duration of infection has a clear effect on observed env interhost diversity. In all data set replicates, the uncorrected
estimates are less than estimates corrected with various substitution models and, thus, can be considered conservative estimates that ignore multiple substitutions and rate heterogeneity across sites; maximum likelihood-corrected
estimates show an even greater discrepancy between primary infection and late infection time points: PIC,
= 0.157; MACS 1993,
= 0.210 (Table 1) (P < 0.001, Mann-Whitney U test); and Lyon,
= 0.162. Estimates of the neutral parameter
are also less in primary infection than in later time point data sets: PIC,
W = 81.87; MACS,
W = 96.51 (Table 1) (P < 0.001, two-tailed Mann-Whitney U test); and Lyon,
W = 87.55.
Although the Lyon data set contains eight sequences while the PIC and MACS 1993 data sets contain nine, PIC and MACS data sets jackknifed down to eight sequences have no significant differences in diversity compared to full data sets. Diversity (in
and
) is slightly higher in Lyon than in PIC (Table 1), likely due to later sampling of three of eight subjects in Lyon (two in 1994 and one in 1995), adding diversity as the epidemic continually expands (14).
To examine possible effects of intrahost divergence on interhost genealogic patterns, we reconstructed evolutionary relationships for each replicate data set. We used these genealogies to infer past and present HIV population dynamics using coalescent methods (11, 17). Classic skyline plots for all of the data sets we examined reflect in their shape an exponential growth rate of the epidemic, consistent with previous descriptions of HIV-1 subtype B. However, estimates of the effective number of infected individuals (Ne) and of the timing of the subtype B epidemic differ between MACS 1993 and PIC 1993 data sets (Fig. 1). Mean Neµ estimates are greater for MACS than for PIC, at 3.23 (variance, 0.10) and 2.42 (variance, 0.02), respectively (P < 0.001, two-tailed Mann-Whitney U test). Mean MRCA estimates are earlier for MACS than for PIC, at 0.15 (variance, 2.0 x 104) and 0.11 (variance, 2.5 x 105) substitutions per site, respectively (P < 0.001).
![]() View larger version (28K): [in a new window] |
FIG. 1. Classic skyline plots inferred from comparisons of sequences between different subjects (interhost comparisons): MACS 1993 and PIC 1993 replicate genealogies (a) MACS first, last, and random time point replicate genealogies (b). MACS 1993 sequences are derived from epidemiologically unrelated subjects in later stages of disease progression; PIC 1993 sequences are derived from other, epidemiologically unrelated subjects early in infection, within 6 months of seroconversion.
|
Divergence in transmission pairs. We measured divergence (mean of summed branch lengths) from an interhost MRCA, in this case the node rooting subtype C to the subtype A outgroups, in donor and recipient sequences. Divergence is significantly less in recipient than in donor sequences (P < 0.05, two-tailed Mann-Whitney U test) in five of seven transmission pairs (Table 2). This corroborates the finding of decreased interhost divergence in early time point env relative to env sampled at time points after primary infection. Variation in the extent, or existence, of this pattern in these seven transmission pairs may be due to variation in the duration of infection in donor subjects prior to transmission, variation in infection and sampling in recipients, or to host factors such as HLA type affecting selection within the recipients.
|
View this table: [in a new window] |
TABLE 2. Transmission pair divergence analysisa
|
![]() View larger version (24K): [in a new window] |
FIG. 2. Divergence of HIV-1 env C2 to V5 sequences over time within individual subjects (intrahost comparisons). Mean pairwise comparisons of viral sequence distances to the initially infecting population (first virus-positive time point) are shown by the points connected by a gray line, with values given on the common right-side y axis. The black lines and left-side y axes show the mean distance between viral sequences at each time point and the deduced MRCA sequence (10, 31) for each patient. This ancestral sequence was derived in the program PAUP* (40) from the basal node on a tree of patient sequences from all time points, defined by an outgroup composed of a single first time point sequence (the sequence closest to the Los Alamos National Laboratory HIV Database subtype B consensus [19]) from each of the other patients. Sequence data and patient numbers were described previously (36). The y axes on the left side were adjusted up or down to provide maximal alignment with plots using the right-side y axis.
|
![]() View larger version (23K): [in a new window] |
FIG. 3. Phylogenetic analysis of viral sequences in patients 1, 3, 9, and 11 from the study of Shankarappa et al. (36). Viral sequences taken from the first two to four biannual visits following infection are shown at tree tips marked with filled symbols; those taken from the final two to four follow-up visits are shown with open symbols. The legend at the right of each patient's clade indicates the time, in years, the specimen was taken following the estimated time of seroconversion. GenBank locus names are given to subtype B HIV-1 sequences used as outgroups.
|
|
|
|---|
; HKY and maximum likelihood-corrected
show 3.2% and 5.3% differences, respectively). Thus, if we consider HIV-1 B phylogenies to be nearly star-like, in which most circulating strains radiate from some central point within the phylogeny (4, 34), we expect a portion of external substitutions in each lineage to be substitutions accumulated after infection. Furthermore, if HIV transmission events occur predominantly during early infection, given the high viremia in acute infection (6) and the positive correlation between transmission rates and viremia (12), these external substitutions may be unimportant to transmission and thus irrelevant in immunogen design. Therefore, for the purposes of including env in rational immunogen design, it may be advantageous to incorporate only early time point sequences. It is currently unclear whether other HIV genes undergo similar evolutionary processes in transmission and primary infection.
Coalescent analyses of HIV interhost sequences assume that the accumulation of substitutions within hosts will not affect estimates of Ne or the timing of the MRCA, because the mean sequence diversity and estimates of
will remain the same. We have shown this not to be the case. Due to the coalescent relationship between genetic diversity, population size, and genealogy, such effects may cause erroneous descriptions of HIV epidemiologic dynamics and history. Our data sets are relatively small, and extrapolation to interhost coalescent analyses with larger numbers of sequences should be made with caution. However, the variance seen among the random time point data sets suggests that common HIV interhost data sets may result in underestimates of the variance associated with population genetic parameters.
In conclusion, we have demonstrated that time since infection can be a significant factor affecting the diversity and divergence observed in HIV interhost sequence data sets. Currently, interhost sequence data sets are not typically assembled with regard to the stage of disease progression within each sampled subject. This result has consequences for population genetic studies of HIV molecular epidemiology that rely on the assumption of continual interhost divergence from the interhost MRCA and certainly may affect the choice of sequences for phylogenetic reconstruction used in rational vaccine design.
We have also shown that the source of the discontinuity of evolutionary divergence in env is likely the evolution toward ancestral states that takes place upon transmission to a new host. Recent data reveal some of the selective mechanisms that are likely to account for these observations. One the one hand, a consistently paced forward evolution (36) most visibly results from the effect of mutations leading to immunologic escape from CTL responses (2, 23, 24). Opposing this trend is the reversion of CTL escape mutations upon transmission to a new host (2, 9, 23, 24). Escape mutants from restricted epitopes that arose within donor individuals having a given HLA type appear to revert to a susceptible epitope sequence in recipients with different HLA alleles that are unable to present these viral peptides to the immune system. Adaptation and concentration of escape mutants in HIV circulating in populations with common HLA alleles have also been postulated (16, 28), which are likely to mute the effect we have shown here. Furthermore, changes in glycosylation patterns associated with escape from host antibody responses also occur early in infection (8). It is also known that drug resistance mutations in HIV-1 can impart decreased relative replication capacity in cell cultures lacking antiretroviral drugs (13) and that resistant forms are selectively lost in vivo following removal of the drug (6a, 8a). Thus, recovery of ancestral states may reflect restoration of fitness lost as a result of immunological escapes in the previous host, as well as replication-advantageous mutations within an HIV immunologically naïve host.
The convergence of viral sequences and specific mechanisms of reversion suggest, circumstantially, that there are strong sequence constraints that are important to viral reproduction across patients. This provides impetus to the development of vaccine immunogens that favor inclusion of viral sequences from early in infection and that embody ancestral or consensus features of viruses circulating in a given population (for a review, see reference 29).
This work was supported by grants from the U.S. Public Health Service, including a training fellowship (NIH T32AI07140) to J.H.
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»