Previous Article | Next Article ![]()
Journal of Virology, February 2004, p. 1962-1970, Vol. 78, No. 4
0022-538X/04/$08.00+0 DOI: 10.1128/JVI.78.4.1962-1970.2004
Copyright © 2004, American Society for Microbiology. All Rights Reserved.
CEPM, UMR CNRS-IRD 9926, Montpellier, France,1 Department of Pathology, University of CaliforniaSan Diego, La Jolla, California 92093,2 School of Biological Sciences, University of Manchester, Manchester, United Kingdom3
Received 10 July 2003/ Accepted 28 October 2003
|
|
|---|
|
|
|---|
Within HIV-1 group M, nine major subtypes (A to D, F to H, J, and K) have been designated, as have 14 circulating recombinant forms (CRF01 to CRF14) (12, 24). Interestingly, recent studies have identified diversity within HIV-1 group O equivalent to that exhibited by group M (25, 33), despite the fact that almost all group O infections are restricted to Cameroon or to individuals with strong links to that region. Although there is phylogenetic substructure within group O phylogenies, distinct group M-like subtypes are not apparent (25). This is not too surprising, given that the prominence of the group M subtypes is strongly linked to founder events in the course of the HIV-AIDS pandemic that occurred outside the Democratic Republic Congo region (23). Analogous founder events have not occurred in the case of group O, as these types of infection have remained strongly associated with one geographic location, Cameroon. The third HIV-1 group, N, also remains restricted to Cameroonian residents, and to date only five infections have been conclusively documented (3).
The development of candidate vaccines specific to different HIV lineages (7) demands a thorough investigation of the consistency of the selective environment, which is presumed to be due primarily to the host immune responses (15, 22, 39) to divergent HIVs. Evidence for adaptive evolution has been found previously among HIV sequences from intra- and interpatient studies (4, 29, 30, 38, 40). Early studies involved the pairwise comparison of synonymous (silent, dS) and nonsynonymous (amino acid changing, dN) substitutions between protein-coding DNA sequences. The dN/dS ratio,
, was then used to measure the difference between these two rates of substitution such that an
value less than 1 corresponds to purifying (negative) selection, an
value of 1 corresponds to neutral evolution (absence of selection), and an
value greater than 1 indicates adaptive evolution (positive selection) (reviewed in reference 37). The pairwise approach to quantifying adaptive evolution assumes that all sites are prone to the same selective pressure, making such tests very conservative. In reality, positively selected sites normally occur in a background of negatively selected sites within a functional protein.
The problem of resolving positively selected sites against this background of negative selection has been solved in a maximum-likelihood (ML) and Bayesian statistical framework (for a review, see reference 37). First, the ML method determines whether positive selection is present by evaluating a series of models with or without a class of positively selected sites. Second, if the favored model includes positive selection, a Bayesian analysis assigns each amino acid site a "posterior probability" of being conserved, neutral, or positively selected.
Here, we focus on positively selected sites that were inferred by using the codon-based method (38), and we determine the extent to which their locations and the intensity of their selection overlap among different HIV lineages. We first quantified positive selection in the major HIV genes (gag, pol, and env) for the three HIV-1 group M subtypes (A, B, and C) and for HIV-2 subtype A. Since env exhibited the strongest signal for positive selection, the location of sites in env with a high probability of being under positive selection was compared across different HIV data sets corresponding to sequence alignments of HIV-1 group M subtypes A through D, group O, and an HIV-2 subtype. The hypothesis that phylogenetically divergent HIV lineages are subject to similar selective pressures was tested by determining whether the occurrence of positively selected sites at the same locations was statistically significant and whether the strength of selection was similar. On the assumption that sites are positively selected primarily as a consequence of pressure from the immune system (15, 22, 39), our results have some interesting consequences for vaccine design, as they suggest the possibility of cross-subtype and -group immunogenicity. We investigated whether the immune response, as represented by experimentally defined epitopes or the positions of N and O glycosylation (13, 28), could account for the observed distribution of the positively selected sites. We found a significant tendency for positively selected sites to fall outside T-helper epitope regions and for positively selected sites to be strongly associated with N glycosylation sites.
|
|
|---|
|
View this table: [in a new window] |
TABLE 1. Data sets used in this studya
|
model of nucleotide substitution with optimal values for the TS/TV rate ratio and the shape parameter (
) of a gamma distribution (with eight categories) of rate variation among sites, both determined during tree construction. The ML method of Yang and coworkers (38) utilized codon-based models that incorporate statistical distributions to account for variable
ratios among codons. Efficient determination of sites under positive selection requires implementation of only six models of codon substitution (M0, M1, M2, M3, M7, and M8) out of the original 14 models (for further details, see reference 38 and http://www.bioinf.man.ac.uk/
robertson/supplementary-material [appendix A]). Briefly, null models M0, M1, and M7 do not allow for the existence of positively selected sites because
ratios are fixed or estimated between the bounds 0 and 1, whereas models M2, M3, and M8 account for positive selection by using parameters that estimate
to be greater than 1. The significance of positive selection can be confirmed with a likelihood ratio test (LRT) between null models and those able to account for positive selection. An LRT is performed by taking twice the difference in log likelihood between nested models and comparing the result to a
2 distribution with degrees of freedom equivalent to the difference in the number of parameters between the models. Models M0 and M1 are both nested with M2 and M3, M2 is nested with M3, and M7 is nested with M8. All the model comparisons (M0 versus M2, M1 versus M2, M0 versus M3, M1 versus M3, M2 versus M3, and M7 versus M8) gave similar results, and for the sake of simplicity we focus on the results of models M7 and M8. M7 uses a discrete (10 classes) beta distribution to model sites with
ratios between the bounds 0 and 1. For each class i (1
i
10) of the beta distribution, the value of the
i ratio and the proportion (pi) of sites belonging to this class are estimated by maximizing the likelihood. M8 adds two additional parameters to model M7 such that p11 can account for a positively selected class of sites where
11 is not constrained by the beta distribution and is allowed to be greater than 1. Once positively selected sites have been shown to exist, i.e., if model M7 is rejected in favor of M8 by the LRT, a Bayesian approach (for which the p1 to p11 values are used as a prior distribution) is used to infer the posterior probability that site i belongs to one of the 11
classes:
. Models were implemented using the CODEML program of the PAML package, version 3.1 (36).
Statistical analysis of sites identified as positively selected.
A "shared-position" statistic and Monte Carlo simulations were used to test whether putative positively selected sites (defined as those having a p11 value of greater than 0.95 when
11 is greater than 1 for model M8) tend to occur at the same positions in data sets I to N (H1) more often than would be expected by chance (H0). The shared-position statistic used is the count of the match between the positions of positively selected sites in one data set and the positions of positively selected sites in another data set. As this test depends on the quality of the alignment among the diverse data sets, the result should be conservative.
To study the "strength" of positive selection, we defined for each site, i, the weighted mean
value as
as previously implemented (7). For each pair of data sets, we tested whether the strength of positive selection was significantly different (H1), as opposed to being equivalent (H0), by using a paired Wilcoxon rank sum test with a continuity correction applied to the normal approximation for the P values (26). Only shared sites having a weighted mean
value greater than 1 in the two data sets being compared were included. Note that the positively selected sites with a weighted
value greater than 1 are not necessarily identified as positively selected by model M8 at the 95% level. The latter sites identified at the 95% level by M8 will be a subset of the former weighted sites. The paired Wilcoxon rank sum test was repeated only for those shared sites identified by M8 at the 95% level.
Finally, Monte Carlo simulations were again used to test a null hypothesis (H0) that sites of positive selection are not associated with the positions of epitope regions, or sites of glycosylation, against the alternative hypotheses (H1) that the positively selected sites are associated with the location of the epitope regions (or various combinations of the three types of regions) or the positions of the glycosylation sites in the different data sets. An additional hypothesis (H2) that the positive selected sites tend to fall outside the defined epitope regions (or various combinations of the three types of regions) was also tested against H0. The epitope regions are experimentally defined and correspond to antibody (Ab), cytotoxic T-cell (CTL), and helper T-cell immune response data available from the Los Alamos National Laboratory HIV Immunology Database (11). As the majority of epitope mapping has focused on subtype B-infected individuals (11), only the positively selected sites identified in data set J were tested. For each data set, the positions of the N and O glycosylation sites were predicted using the NetNGlyc (R. Gupta, E. Jung, and S. Brunak, unpublished data) and NetOGlyc (9) programs, respectively.
For all Monte Carlo simulations, 9,999 repetitions proved to be enough to reach an asymptotic state. The programs used to implement the Monte Carlo simulations are available upon request from M. Choisy.
|
|
|---|
values for gag, pol, and env.
The results for the mean
values (assuming the same value for
at all sites) for the genes gag, pol, and env, and for the individual subunits of env (gp120 and gp41), are shown for HIV-1 group M subtypes A, B, and C and for HIV-2 subtype A in Fig. 1. Except for the group M subtype A, B, and C results for gp120 and subtype B for gp41, all
values are less than 1, indicating that the majority of sites are subject to purifying selection. The effect of purifying selection is particularly strong in the gag and pol genes but is much weaker in the envelope region, which is not surprising given that env codes for the envelope surface proteins, which are the most exposed to the immune system. Note that despite the low mean
values in the gag and pol genes, positive selection can still occur at a minority of sites, but this signal can be averaged out by M0 and pairwise methods. For example, others have previously found a comparable
value (0.196) for the pol gene of a subtype B alignment as well as strong evidence for adaptive evolution (38). The contrast in mean
ratios between gag and pol compared to that of the env regions indicates that the env region contains more positively selected sites than do the other genes. Within the env region, positive selection appears to be particularly strongly associated with the gp120 subunit, coding for the extramembrane envelope protein.
![]() View larger version (37K): [in a new window] |
FIG. 1. Mean ratios in gag, pol, env, env-gp120 and env-gp41 for HIV-1 group M subtypes A, B, and C, and HIV-2 subtype A (data sets A to K and N to V in Table 1). The mean ratios are calculated by averaging the results over all of the sites and are obtained from model M0. The numbers above the bars indicate the number of sequences and the number of codons in each data set. For example, "11/404" above the first gag bar indicates that there were 11 sequences and 404 codons in the gag HIV-1 group M subtype A data set (called data set A in Table 1).
|
> 1) and rejected those models that were unable to account for positive selection (M0, M1, and M7). For the sake of clarity, only results for M8 are presented in Table 2 (results for the other models are available from http://www.bioinf.man.ac.uk/
robertson/supplementary-material (appendixes B and C). M2 and M8 identified the same positively selected sites when taking posterior probabilities greater than the 95% level using the Bayesian approach. M3 identified all of the sites identified by M2 and M8 and several more. We consider the sites identified by M8 only, as M3 has the potential to overestimate the number of positively selected sites (2, 38). The number of positively selected sites identified by model M8 was 22 for HIV-2 subtype A, between 30 and 35 for HIV-1 M subtypes, and 40 for HIV-1 O (Table 2). Figure 2 shows the location of these putative positively selected sites across the multiple alignment of data sets I to N. Positively selected sites are not restricted to the variable regions (V1 to V5) of env, a finding that supports previous work that used a maximum-parsimony-based method to identify amino acid sites that were potentially under the influence of positive selection in an HIV-1 subtype B alignment of sequences (35). |
View this table: [in a new window] |
TABLE 2. Positive selection in the env genea
|
![]() View larger version (74K): [in a new window] |
FIG. 2. Positions of positively selected sites across env for HIV-1 group M subtypes A, B, C, and D; group O; and HIV-2 subtype A (data sets I to N in Table 1). Each data set analyzed by CODEML is represented by one sequence, with the sites included in the analysis indicated with boldface type. Sites identified as being positively selected with a posterior probability of more than 95% are shaded. Notations above the sequences divide env into the gp120 and gp41 subunits and show the position of vpu and the second rev exon with the beginning and end of regions. The positions of the five variable regions V1 to V5 are indicated. Sites critical for CD4 binding are identified by a vertical bar, and sites implicated in the CXCR4 to CCR5 receptor switch are indicated with an * above the sequences. The numbers 2, 3, and 4 below the sequences indicate the number of data sets for which positive selection was identified at that site. The representative sequences for HIV-1 group M subtypes A, B, C, and D; group O; and HIV-2 subtype A are MA246, MBC18, BU910112, 84ZR085, ANT70, and CBL21, respectively.
|
|
View this table: [in a new window] |
TABLE 3. Monte Carlo simulations testing the association of sites of positive selection between data setsa
|
ratio for each site (see Materials and Methods) that had a value greater than 1 (Fig. 3). When sites that had a weighted
value greater than 1 were tested among different data sets with a paired Wilcoxon ranked sum test (Table 4), the comparison was significant for the strength of selection differing between HIV-1 subtypes B and C, a result that is in agreement with that of Gaschen and coworkers (7). This was also the case for the comparison between HIV-1 subtypes A and B and between HIV-1 group O and HIV-2 subtype A. However, no other comparisons were significant, indicating that generalizations about differences in the strength of selection between diverse HIV data sets should not be made based on the available data. For the subset of sites with a weighted
value greater than 1 and that were identified as positively selected by model M8 at the 95% level, the comparisons between HIV-1 subtypes A and B, B and C, and between HIV-1 group O and HIV-2 were significant (P = 0.0001, 0.0282, and 0.0156, respectively; data available at http://www.bioinf.man.ac.uk/
robertson/supplementary-material [appendix D]).
![]() View larger version (23K): [in a new window] |
FIG. 3. The weighted mean ratio greater than 1 at each codon position in the env data sets comparing HIV-1 group M subtypes A, B, C, and D; group O; and HIV-2 subtype A (data sets I to N in Table 1). The weighted mean value for each site is calculated by multiplying by the posterior probability for each class under M8 and summing the results (see Materials and Methods). The positions of the five variable regions V1 to V5 are indicated.
|
|
View this table: [in a new window] |
TABLE 4. Paired Wilcoxon ranked sum test to determine differences in the strength of positive selection between data setsa
|
|
View this table: [in a new window] |
TABLE 5. Correlation of positively selected sites with epitopes in the env gene of HIV-1 group M subtype B (data set J)a
|
0.05) were found for the test of the association of O glycosylation sites identified in each HIV-1 data set with the putative positively selected sites (results not shown). |
View this table: [in a new window] |
TABLE 6. Association of positively selected sites with sites of N glycosylation in env
|
|
|
|---|
Further analysis of env revealed that a proportion of the sites that were identified as positively selected (ranging from 25 in the HIV-1 M group subtype A data set to 40 in the HIV-1 group O data set) were at the same positions in the different data sets (Fig. 2). We believe that only the immune response could be driving this propensity of HIV to exhibit adaptive molecular evolution to such an extent. Furthermore, for the HIV-1 group M comparisons, between 10 and 16 sites were shared depending on the subtypes compared (Table 3). On the assumption that the immune response provides the evolutionary pressure for amino acid change at these sites (15, 22, 39), the finding that positively selected sites are shared between divergent HIV-1 lineages suggests that the immune response may be targeting the same viral regions in the different groups and subtypes, thus raising the possibility of cross-subtype or -group immunogenicity.
However, it has been reported previously (7) that the strength of selection at positively selected sites in the C2V3 region of group M subtypes B and C is different. We made the same observation here, not only for the C2V3 region but also for the entire env gene, and we moreover show that the finding is statistically significant (Table 4). Importantly, we found that (i) there is a tendency for the position of positively selected sites to be correlated among different HIV-1 data sets and that (ii) there can be, at the same time, a difference in the strength of selection for some comparisons; these findings are not mutually exclusive. This is true because selection may be acting at the same sites but to differing extents, or, alternatively, the different strengths of selection might be predominantly at the sites that are not correlated among the different data sets. Nevertheless, as most comparisons of the strength of selection are not significant, generalizations about differences among diverse HIV data sets should not be made based on the available data. Indeed, differences in the strength of selection may be due to other factors such as the predominant form of transmission in a given subtype or the amount of diversity in that subtype.
To explicitly test the assumption that the majority of adaptive evolution observed in the HIV envelope is due to the immune response, we then investigated the potential of the different types of immune response (Ab, CTL, and T helper) to account for the location of the positively selected sites. This comparison is relatively crude because the epitopes that can be recognized in different individuals and their frequencies in different populations will vary. Ideally, viral sequences would be analyzed that are from a distinct population in conjunction with information concerning the types of epitope that could be recognized, as has been done for comparisons between HLA haplotypes and polymorphisms present in viral sequences (15). Also, some of the positively selected sites may be relevant to nonlinear epitopes, which are difficult to detect since they are formed by protein tertiary structure bringing distant sites into proximity. For example, a "glycan shield" model (28) has been proposed, which suggests that linear epitopes in regions essential for viral fitness and that are unable to tolerate mutation can be protected from neutralizing antibodies by the bound carbohydrates. Mutations of gp120 would cause the permanent rearrangement of the carbohydrates, thus creating a moving protective shield around epitopes that are unable to tolerate mutation, as they are in functionally conserved regions. Despite these limitations, a significant result was found (Table 5) for the tendency of the T-helper and possibly for the CTL epitope regions to not include positively selected sites. This result might be explained by a finding that CTL epitopes are more concentrated in relatively conserved regions across the HIV genome, whereas positively selected sites will have a tendency to be detected in the more variable regions (39). Alternatively, positively selected sites may correspond to proteolytic cleavage sites such that mutation in the epitope-flanking residues alters intracellular processing, thereby permitting CTL escape (39).
We also investigated the predicted positions of the N glycosylation sites with respect to the positions of the identified positively selected sites. The significance of N glycosylation sites is that they allow the binding of carbohydrates to the viral envelope to mask viral protein epitopes from the immune response (5, 19, 28, 31). The bound carbohydrates are large molecules contributing to half of the molecular mass of gp120 (5) and that are linked to the gp120 protein on N glycosylation sites, and, to a lesser extent, sites of O glycosylation. They are thought to play an important role for the stability of the gp120 molecule (19), for CCR5 and CXCR4 coreceptor utilization (21), and for escape from the immune defense (5, 10). The relatively rapid turnover of mutations of the gp120 protein may also induce continual conformational changes, and such a constantly moving structure may help to distort epitopes and prevent antibody binding (13). In accordance with previous reports (5), between 22 and 39 N glycosylation sites were predicted (Table 6) for the different HIV-1 data sets. When the N glycosylation sites present in the HIV-1 data sets were considered, Monte Carlo simulations indicated that these sites are significantly associated with putative positively selected sites. These findings of a correlation among positively selected sites but not of the location of Ab epitope regions is consistent with the glycan shield model of viral escape (28). Interestingly, no N glycosylation sites were detected in the HIV-2 data set, despite glycosylation for HIV-2 being previously reported (14). This finding seems to reflect the very low number of such sites in HIV-2 strains. HIV-2 is apparently less virulent than HIV-1 (8), possibly due to HIV-2 being less antigenic, thus accounting for the lack of N glycosylation sites.
In our opinion, potential correlations of adaptive molecular evolution among divergent HIVs, such as those we have detected here, warrant further investigation, as they are indicative of possible shared antigenicity. Given the unquestionable need for an HIV-AIDS vaccine to elicit an immune response against multiple group M subtypes, specifically in Africa where multiple subtypes frequently cocirculate, a hypothetical vaccine cocktail that would include antigens from a number of genomic regions is clearly worth investigating. The present preoccupation with consensus and ancestral sequences targeted at an individual subtype as optimal immunogens (for an example, see references 7 and 16) make limited or no specific attempts to elicit immune responses that may be cross-reactive to different subtypes. In addition, the most antigenic viral regions will be embedded in sequence of differing immunogenic potential. Thus, there is a possibility that constructing a consensus sequence from the genetic material of circulating viruses would result in the least optimal antigenic regions being included in the vaccine. This result would occur because the consensus sequence would represent optimal genetic material for immune escape because it would have sequences from multiple viruses that have successfully escaped the immune response. Furthermore, neither a consensus sequence nor a reconstructed ancestral sequence (due to ongoing recombination within individuals [with or without superinfection] and positive selection resulting in escape mutants with the same convergent amino acid changes) can represent any virus that has ever existed and so may lack important properties that could be of immunogenic importance in a potential vaccine (for example, folding). In conclusion, if we are to control HIV, we must understand its evolution and conceive appropriate intervention strategies accordingly.
We also thank the Wellcome Trust (which provided assistance through their Biodiversity program while D.L.R. was at the Department of Zoology, University of Oxford, where this work was begun), the National Institutes of Health (AIDS training grant number AI07384), and the CNRS for funding (M.C. is supported by a Bourse Docteur Ingénieur from the CNRS-Région Languedoc Roussillon).
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»