Previous Article | Next Article ![]()
Journal of Virology, June 2002, p. 5435-5451, Vol. 76, No. 11
0022-538X/02/$04.00+0 DOI: 10.1128/JVI.76.11.5435-5451.2002
Copyright © 2002, American Society for Microbiology. All Rights Reserved.
Harvard School of Public Health,1 Harvard Medical School, Boston, Massachusetts,7 Los Alamos National Laboratory, Los Alamos, New Mexico,2 University of Washington, Seattle, Washington,3 University of Cape Town, Cape Town, South Africa,4 Botswana-Harvard Partnership for HIV Research and Education,5 National Health Laboratory/National Blood Transfusion Center, Gaborone, Botswana6
Received 25 October 2001/ Accepted 13 February 2002
| ABSTRACT |
|---|
|
|
|---|
B sites (GGGRNNYYCC) were identified within the consensus sequences of the entire set or any subset of HIV-1C isolates. This study suggests that the consensus sequence approach could overcome the high genetic diversity of HIV-1C and facilitate an AIDS vaccine design, particularly if the assumption that an HIV-1C antigen with a more extensive match to the circulating viruses is likely to be more efficacious is proven in efficacy trials. | INTRODUCTION |
|---|
|
|
|---|
The divergent patterns of the AIDS epidemic in different geographic areas may be an important consideration for the design and testing of HIV vaccines. The monophyletic HIV-1C epidemic in southern Africa contrasts with the multisubtype epidemic in other regions of sub-Saharan Africa (reviewed in references 19, 31, and 43). Both cross-clade immunity (6, 10, 17, 30, 53) and subtype-specific immune responses have been reported (11, 15, 45). This indicates that the relative importance of cross-reactive versus clade-specific immunity that might be elicited by a protective vaccine remains unknown.
Several studies that analyze the full-length HIV-1C genome have been recently reported (21, 28, 33, 36, 37, 39, 41, 44, 46). The 27 nonrecombinant near-full-length HIV-1C isolates that were sequenced and analyzed earlier (21, 28, 37, 41, 44, 46) represented viruses from Botswana (9 isolates), India (9 isolates), Tanzania (2 isolates), Zambia (2 isolates), Brazil (2 isolates), Ethiopia (1 isolate), South Africa (1 isolate), and Israel (1 isolate). An additional set of four near-full-length genome sequences from South Africa (52) was used in this study. Another new set of five near-full-length genome sequences from Ethiopia was reported recently (24), although the sequences were not available in the public domain and were not included in this study. By performing analyses of phylogenetic patterns within available near-full-length genome nonrecombinant HIV-1C, we addressed (i) the phylogenetic relationship of HIV-1C viruses, including genetic diversity within and between geographically distinct subsets, and (ii) analysis of the HIV-1C consensus sequence as reference information for HIV-1C vaccine studies in southern Africa and, particularly, in Botswana.
HIV-1 diversity has been considered a major problem for the development of a vaccine. A relatively high level of HIV-1C intrasubtype diversity in southern Africa (12, 41, 52) increases the magnitude of the challenge. Choosing the "right" vaccine candidate by employing a particular viral strain, clone, or isolate is still the current approach to HIV vaccine design. A homologous vaccine might be an ideal one, although in the case of HIV-1C infection it is a rather unrealistic goal. If a higher homology between vaccine and circulating strain(s) results in a more efficacious vaccine, then a consensus sequence approach to AIDS vaccine design might lead to a better vaccine. In this study we address (i) whether the genetic diversity of HIV-1C might be overcome by using a consensus sequence as a vaccine candidate instead of any particular viral isolate and (ii) what the extent of potential vaccine coverage of circulating viral variants would be.
The most common presentation of the consensus is a one-string consensus sequence showing nucleotides or amino acids that are the most frequent (i.e., HIV consensus sequences in reference 9) or occurred in more than 50% of cases at a particular position of alignment, and uncertainties (i.e., question marks or "X") otherwise. However, such a one-string consensus sequence does not include any diversity information (except for uncertainties) and might not be very informative in the case of high viral diversity. Moreover, minor residues with high frequencies (i.e., 20 to 49%) might be excluded from the simple one-string consensus. In this study we presented an extended version of the HIV-1C consensus sequence that addressed amino acid diversity by combining the consensus sequence with the amino acid frequency data across the HIV-1C proteins. An extended consensus of the HIV-1C proteins should provide valuable information for vaccine design and for the potential monitoring of epitope immunity in vaccine efficacy trials (22).
| MATERIALS AND METHODS |
|---|
|
|
|---|
|
9-kb amplicon was obtained by one round of long-range PCR using the 696-9690 primer set (14). After a gel extraction of the amplicon, cloning was performed using a TOPO XL PCR cloning kit (Invitrogen, Carlsbad, Calif.). DNA plasmid purification and both-strand sequencing were performed as described previously (38, 40, 41). Multiple alignment procedures. The alignment procedure applied in the phylogenetic study involved the application of ClustalX (version 1.81) (50) followed by manual alignment editing using BioEdit (23). Pairwise alignment parameters were set to the dynamic "slow-accurate" programming, using 10 as the gap opening penalty and 0.1 as the gap extension penalty. Multiple alignment parameters included a gap extension penalty equal to 0.2.
Phylogenetic distances. The pairwise evolutionary distances of nucleotide alignment were computed by DNADist with the Kimura two-parameter model (PHYLIP: phylogeny inference package [versions 3.52c and 3.572c]; University of Washington, Seattle). Pairwise distances between translated amino acid alignments were performed by PROTDist with the PAM model (PHYLIP: phylogeny inference package [versions 3.52c and 3.572c]). By using sequence distances that weighted positions in the viral genome equally, our method implicitly assumed that all positions were equally important in determining vaccine protection. If information were available on the relative structural and immunological importance of positions, then it could be incorporated into the analysis.
Consensus sequences. Consensus nucleotide sequences were obtained using BioEdit (23). Gaps were treated as a fifth residue. The threshold frequency for inclusion of a residue in a consensus sequence was 51%. Sites where no residue exceeded the threshold were scored as missing. Consensus amino acid sequences were obtained by translating nucleotide sequences, realigning codons using ClustalX (version 1.81) (50) followed by manual editing using BioEdit (23), and then computing the consensus sequences using Consensus (version 3; available from M. Essex). In Fig. 1 to 3 and 5 the consensus sequence 73C (or 73C_cons) is based on all 73 HIV-1 subtype C sequences, 51BW (or 51BW_cons) is based on 51 sequences from Botswana, 22nonBW (or 22nonBW_cons) is based on 22 non-Botswana sequences, 9IN (or 9IN_cons) is based on 9 sequences from India, and 5ZA (or 5ZA_cons) is based on 5 sequences from South Africa.
|
|
Statistical analysis.
Statistical analysis and basic graphical delineation were done using Microsoft Excel 2000 software (Microsoft Corp.), Splus (version 6.0; Insightful Corporation, Seattle, Wash.), and SigmaPlot 2001 (SPSS Inc.). Additional graphical presentation was performed using Adobe Illustrator software (version 8.0). Two sample t tests were used to compare the mean values of distances to the consensus sequence between sets of sequences. Corrected chi-square tests were used to compare the frequency of sequences with three NF-
B sites between sets of sequences. To assess if mean diversity between samples was different from the mean distance to consensus, we developed a new test statistic that appropriately accounted for the nonindependence of pairwise distances. For a set of n sequences, the new statistic is defined as the difference in between-sample and to-consensus-sample means,
b -
c, divided by the square root of the sum of the squared standard error of
b and the squared standard error of
c. The standard error of
b cannot be calculated in the standard way, because many of the m = n(n - 1)/2 pairwise distances are not independent. To compute it, we first noted that, by a direct calculation, c of the m(m - 1) pairs of pairwise distances, where c = m(m - 1) - m(n - 2) (n - 3)/2, have a positive correlation due to a common sequence in the two distances. We assumed a common linear correlation and estimated it by Pearson's correlation, r, based on all possible pairs of pairwise distances with a shared sequence. Then, the standard error of
b can be shown to equal the square root of [(1/m) + (c x r/m2)] multiplied by the standard deviation calculated using all pairwise distances. We compared the resulting Z statistic to a standard normal distribution to test the null hypothesis. For nucleotide distances, we calculated r to be 0.432, 0.350, 0.486, 0.696, and 0.869 for the 73 HIV-1C, 51BW, 22nonBW, 9IN, and 5ZA sequence sets, respectively. For protein distances, we calculated r to be 0.319, 0.299, 0.304, 0.413, 0.289, 0.420, 0.286, and 0.134 for the 72 or 73 HIV-1C Pol, Vif, Vpr, Tat, Rev, Vpu, Env, and Nef sequences, respectively.
Distribution of distance to consensus. If the consensus sequence from a sample of n sequences (one per individual) in a geographic region is used in a vaccine targeted to the region, then it is of interest to assess the distribution of the distance from a randomly sampled sequence to the observed consensus sequence. The mean distance of sequences in the population to the observed consensus was estimated by the average distance of the samples to the consensus sequence. Using the 73 HIV-1C sequences, a 95% confidence interval (CI) for the mean distance was calculated under the assumption that the distances are an independent, identically distributed random sample from a normal distribution. In addition, the 80th and 95th percentiles of the distance to the observed consensus were estimated with the sample percentiles, and the nonparametric bootstrap percentile method was used to compute 95% CI for the percentiles.
Proximity of a new sequence to the consensus. The probability that a randomly selected sequence in a population is within a certain threshold, D%, of the observed consensus sequence was estimated by the sample proportion of sequences within the threshold. A 95% CI for the probability was calculated using the normality approximation to a binomial random variable.
Extent of the consensus change.
We considered the extent of anticipated change in the observed consensus sequence that was built on n = 73 HIV-1C sequences if it is rebuilt including a new sequence. We addressed this by evaluating how much the distribution of the distance of a randomly sampled virus to the observed consensus sequence changes in response to the introduction of a new sequence. The mean amount by which the mean distance to the observed consensus changes was estimated by the sample mean of the n numbers di =
[-i] -
|, i = 1,..., n, where
[-i] is the sample mean distance to the observed consensus, which was calculated using the n sequences with the ith sequence removed, and
is the sample mean distance to consensus for the full data set. A 95% CI for the mean change was calculated using the nonparametric bootstrap percentile method. In addition, the nonparametric bootstrap method was used to estimate 95% CIs for the mean amount by which the 80th and 95th percentiles of the distance to the observed consensus changes.
All tests were two tailed, and a cutoff P of 0.05 was used to judge statistical significance.
Accession numbers. The 42 new, nonrecombinant HIV-1C nucleotide sequences from Botswana were deposited in GenBank under accession numbers AF443074 to AF443115.
| RESULTS |
|---|
|
|
|---|
B site seen in the majority of the isolates (see details below), and (iv) a 4-amino-acid extension at the C-terminal region of Pol in some sequences (21.4%). Phylogenetic relationship of near-full-length HIV-1C. The molecular phylogeny of the 73 nonrecombinant HIV-1C sequences was analyzed by applying MP, NJ (or distance), and ML methods. Figure 1A represents an MP phylogenetic tree that was found to be the most parsimonious tree and had a minimal score of 29,382 when gaps were treated as missing. Figure 1B shows an ML tree with a score of -91931.28. The MP and ML scores are not comparable.
Branch lengths in the MP and ML analyses are not easily compared. This is because each method uses a large number of sites not used by the other method. MP uses sites that contain gaps; ML does not. This difference may be responsible for the relatively longer "backbone" branches of the MP tree (Fig. 1A) versus the ML tree (Fig. 1B). ML uses sites that are nearly constant, differing for only one sequence; MP does not. The longer terminal branches in the ML tree (Fig. 1B), compared to the MP tree (Fig. 1A), are due in part to this difference. These sites, which represent nucleotide changes unique to a single sequence in the analysis, necessarily increase the length of the terminal branch for that sequence.
Although there are numerous well-supported "lineages" (a term that is used hereafter for discussion purposes and does not imply taxonomic standing) of sequences within HIV-1C, the backbone of the HIV-1C lineage itself is not well supported; hence, the tree topology is unstable. This instability is evident in the very different positions of lineages between the MP and ML trees (Fig. 1) and in the prevalence of relatively short branch lengths along the backbones of both trees. The sequences used in this study are nearly full-length genomes: these genomes can yield little or no further sequence data. Hence, to obtain a more-stable, reliable topology it will be necessary to find new genomes having sequences that help to support these unstable portions of the tree and/or to improve methods of phylogenetic analysis to make better use of the available sequences. For example, current ML analyses discard sites that have gaps for any sequences included in the analysis. This is unfortunate because many of these sites represent "indels" (insertions and deletions) outside the hypervariable regions of the HIV-1 genome, and many indels in other organisms are known to contain significant phylogenetic information.
Because a detailed phylogenetic analysis of HIV-1C sequence evolution was outside the scope of this study, we addressed only the general shape of the ML tree based on near-full-length genome HIV-1C sequences, as well as congruency among the ML, MP, and NJ trees. The basal pair of sequences found in the MP tree (96BW17A09 and 00BW1471.27 [Fig. 1A]) agrees with results of an explicit search for the root of the HIV-1C lineage (49). This result is reflected in the ML tree (Fig. 1B), which is assumed to have the same root as the MP tree. Note that the ML tree includes only HIV-1C sequences and that, in phylogenetic terms, all trees produced in this study are unrooted.
A number of well-supported lineages (shaded in Fig. 1) were identified within the 73 near-full-length genome HIV-1C sequences. All these lineages occur in the best trees generated by all three phylogenetic methods (Fig. 1). Topology of the samples grouped within lineages was nearly identical in MP, NJ, and ML trees (sequence 00BW1811.3 was within the 00BW2128.3-96BW16.26-00BW1773.2-96BW0502 lineage in the ML but not the NJ and MP trees). Generally the bootstrap values were highly consistent by the MP and NJ analysis, although in one case (samples 96BW0407 and 98BWMO18.d5) a relatively high bootstrap value of 83 for a lineage in the NJ tree was not supported in the bootstrap analysis of the MP tree (value of 57 only). Overall, 57 of 73 (78.1%) HIV-1 sequences were found within lineages. Eight of the identified lineages were composed of two isolates, two lineages contained three sequences, three lineages embraced four isolates, one lineage included five, and two lineages consisted of eight sequences. Eight of nine samples from India and four of five samples from South Africa formed separate lineages, apparently demonstrating relevance to the phylogenetic founder effect. Sequences from Brazil, Ethiopia, and Israel were found in one lineage, which suggested a phylogenetic relatedness of these geographically distinct isolates. The Botswana sequences were scattered across the phylogenetic tree, with 39 viral isolates (76.5%) that formed 13 distinct lineages supported by high bootstrap values (Fig. 1). The Botswana lineages did not include any of the 22 non-Botswana sequences, and none of the Botswana sequences were part of any of three non-Botswana lineages. A number of sequences outside the lineages demonstrated similar topologies in MP and ML trees (00BW3886.8, 94IN476.104, and 00BW1783.5). In the MP analysis, neither the content of each lineage nor the overall topology of the best tree was changed when additional outgroup sequences were introduced to the alignment (HIV-1 subtype A, Q23-17, accession number AF004885; HIV-1 subtype B, HXB2CG, accession number K03455; data not shown).
Having addressed a phylogenetic relationship among consensus sequences, we compared the topology of the consensus sequence that represented the entire set of the 73 available HIV-1C sequences in the study to the topology of the consensus sequences that represented different subsets comprising 51 Botswana isolates, 22 non-Botswana sequences, 9 sequences from India, and 5 sequences from South Africa. The origin of the consensus sequences was reflected in their phylogenetic relationships (Fig. 1). While an observed closeness of the consensus sequences to the tree root in MP (Fig. 1A), NJ (data not shown), and ML (Fig. 1B) trees was not unusual, note that the consensus sequences were placed closer to the root of the tree than was any particular sequence from which they originated. As expected, the consensuses of the Indian and South African subsets clustered within the corresponding groups of sequences. Interestingly, a consensus sequence of the 22 non-Botswana isolate subsets, which included both subsets from India and South Africa, was closer to the root of the tree than was 5ZA_cons or 9IN_cons. An obvious nearness between consensuses representing 51 sequences from Botswana and the entire set of 73 HIV-1C sequences was not surprising, because Botswana sequences were dominant in the entire set. However, a relative closeness between consensuses of 51 sequences from Botswana and 22 non-Botswana sequences was striking. Assuming that sequence homology between the vaccine candidate and the infecting or challenging virus is essential, the overall topology of the consensus sequences and their phylogenetic relationship with corresponding sequences might suggest that candidate AIDS vaccines incorporating the consensus sequence have a greater potential than those incorporating any particular isolate sequence.
In terms of pairwise genetic distances between consensus sequences, the entire set of 73 HIV-1C was almost identical to the 51BW set (Fig. 2). This result is due to the predominance of Botswana sequences in the entire set. Nucleotide distance between 51BW and 22nonBW consensuses was only 0.48%. The consensus sequences for nine sequences from India and five sequences from South Africa differed from the 51BW consensus by 2.08 and 1.84%, respectively, while the distance between Indian and South African consensus sequences was 3.5%. Overall, pairwise genetic distances (Fig. 2) demonstrated a remarkable closeness between consensus sequences of different HIV-1C subsets.
|
HIV-1C Gag extended consensus. Figure 4 delineates the extended version of the consensus sequence for the HIV-1C Gag p17, p24, and p2/p7/p1/p6 based on the amino acid alignment of 73 subtype C sequences. The invariable amino acid residues within the HIV-1C Gag p17 were observed at 37 out of 129 positions (28.7%). In addition to the invariable amino acids, 57 positions (44.2%) across HIV-1C Gag p17 were relatively conserved by showing less than 10% diversity at a particular amino acid residue in the consensus sequence. The number of variable residues in the consensus that had a frequency of 90% or less was 35 (27.1%), which was at the level of invariable amino acids. The most variable amino acid residues that had frequencies of less than 50% in the consensus sequence were seen at four positions within p17. The alternative characteristics of amino acid residues (i.e., by charge) at positions 15 (K46, T30, A12), 90 (E36, A27, K21), 91 (G38, K29, N11, E10), and 119 (K45, E44) might suggest a substantial difference in the biological properties of different p17 proteins (subscripts indicate the percentage of a residue's frequency at that particular position in the alignment). Positions 15, 28, 62, 90, 91, 93, 111, 115, 118, and 119 might be considered the most variable within p17 by virtue of accommodating from seven to nine different amino acids at each position. A few sequences demonstrated indels in the C-terminal part of p17.
|
As shown in Fig. 4C, HIV-1C Gag p2/p7/p1/p6 was comprised of 45 invariable amino acid residues (34.2%), 56 relatively conserved residues with less than 10% diversity in the consensus (42.4%), and 31 variable residues with a frequency of 90% or less in the consensus sequence (23.5%). The frequency of amino acid residues at positions 372 and 373 was less than 50% in the consensus sequence: N49, S40, 8, T1, Q1, G1 and T42, 16, A15, I10, V5, P5, S3, M3, G1, respectively (where "" denotes a gap introduced to improve alignment). The asparagine and serine at position 451 were observed at the frequency of 50% and 47%, respectively. The positions 373, 389, and 478 were represented by eight or nine different amino acid residues each. Although the insertions were not rare across p2/p7/p1/p6, most amino acid residues in the insertions were seen at low frequency. However, the frequency of some amino acid residues within the insertion between positions 455 and 456 reached the level of 30% (Fig. 4C).
Table 1 displays the amino acid distances across HIV-1C Gag, p17, p24, and p2/p7/p1/p6 for the entire set of 73 HIV-1C sequences (73C) and for the subsets. A high diversity between samples in the p17 (mean, 14.8%) and in the p2/p7/p1/p6 (mean, 12.7%), together with a low diversity within the p24 region (mean, 5.1%), resulted in the overall mean diversity of 9.5% across the entire HIV-1C Gag. The amino acid diversity among isolates from Botswana was slightly higher than the diversity among non-Botswana samples for the entire Gag and its subregions. Statistical significance of the differences depended on the presence of sequences from India in the group of non-Botswana samples, which changed from highly significant to nonsignificant if sequences from India were excluded. No significant differences were found between the subsets of Botswana and non-Botswana sequences for the amino acid distances to the consensus sequences (P > 0.10). However, the subset of sequences from Botswana demonstrated significantly higher amino acid diversity than the subset from India for the entire HIV-1C Gag and for any of the Gag regions in both between-sample (P < 0.0001) and to-consensus (P < 0.0001) comparisons. The subset of HIV-1C sequences from South Africa showed lower diversity within the p24 region than Indian sequences, although the difference was not statistically significant (P = 0.40).
|
Amino acid diversity within other HIV-1C proteins is shown in Table 2 together with P values and ratios for the comparison of between-sample and to-consensus distances. The between-sample distances ranged from 6.42% in Pol to 25.20% in Vpu, while to-consensus sequences were in the range of 3.45% in Pol to 13.68% in Vpu. Importantly, to-consensus distances were significantly lower than between-sample distances for each HIV-1C protein. Amino acid diversity in Env demonstrated the highest ratio of 2.4 in comparison of between-samples and to-consensus distances. Table 3 summarizes the diversity among lineages across HIV-1C proteins. For most cases amino acid distances are shorter than the average distances for the entire set of 73 HIV-1C, although there are a number of examples of equal or even higher diversity within lineages.
|
|
Based on the set of 73 HIV-1C sequences, we estimated that the mean distance to the observed consensus sequence was 4.86%, with a 95% CI of 4.69 to 5.02%. The estimated 80th percentile was 5.45% (95% CI, 5.21 to 5.52%), and the 95th percentile was 5.80% (95% CI, 5.51 to 6.79%). Thus, on average HIV-1C sequences are within about 5% of the observed consensus, and with high confidence, 95% of viruses in the sampled population are within 6.79% of the observed consensus sequence.
The observed proportion of the 73 HIV-1C sequences within D% of the observed consensus, with D being equal to 4, 5, 6, and 7%, respectively, was 8.22% (95% CI, 1.92 to 14.52%), 54.79% (95% CI, 43.48 to 66.21%), 97.26% (95% CI, 93.52 to 100.00%), and 98.63% (95% CI, 95.96 to 100.00%). Thus, fewer than 15% of HIV-1C viruses are inferred to be within 4% of the observed consensus, between 43% and 66% of viruses are inferred to be within 5%, and at least 96% are inferred to be within 7% of the observed consensus.
We found that a single sequence does not appreciably alter the distribution of distances to the observed consensus. Based on the 73 HIV-1C viruses, we calculated that the mean of the 73 numbers di = |-x[-i] - -x| was 0.135%. The bootstrap 95% CI about the mean change in the mean distance to the observed consensus was 0.121 to 0.159%. The estimated mean change in the 80th percentile of the distance to the observed consensus was 0.126% (95% CI, 0.111 to 0.135%), and for the 95th percentile the estimated mean change was 0.117% (95% CI, 0.075 to 0.197%).
HIV-1C LTR promoter-enhancer region.
Figure 5 depicts the extent of conformation to the GGGRNNYYCC consensus within NF-
B sites among the 73 HIV-1C sequences and 5 consensus sequences. The number of NF-
B sites within the promoter-enhancer region of HIV-1C varied from one (isolate 00BW1880.2) to three or more (isolates 96BW0502 and 96BWMO3.2), while the consensus sequences for the entire set or any subset of HIV-1C isolates demonstrated three NF-
B sites that conformed to the GGGRNNYYCC consensus. Within the subsets the HIV-1C sequences from Botswana demonstrated three NF-
B sites in 32 of 51 (62.7%) of cases, while only 10 of 22 (45.5%) non-Botswana sequences had the third NF-
B site (P > 0.10 [chi-square test]). The potential or prospective NF-
B sites represent a region that does not comply with the GGGRNNYYCC consensus but, in fact, is relatively close to it and might become a new NF-
B site due to a few point mutations. Viral isolates from Botswana that did not have three NF-
B sites demonstrated a potential/prospective NF-
B site more often than non-Botswana subtype C sequences (25.5 versus 4.5%, respectively; P < 0.001), which suggests that the promoter-enhancer region might be a hot zone within the evolving HIV-1C.
| DISCUSSION |
|---|
|
|
|---|
From the point of view of vaccine design strategy, a preferred homology between the vaccine candidate and the circulating viruses might be a hard task to achieve, and a differential approach to an AIDS vaccine formulation might be required for different geographic areas based on particular molecular epidemiological data. For example, the predominance of HIV-1C in the southern African epidemic might be a strong argument for a subtype C specific vaccine design for the southern African countries.
Our results of the phylogenetic analysis of 73 near-full-length HIV-1C genomes, taken together with the presumed homology between the vaccine and the circulating or infecting virus, suggest that a consensus sequence approach to vaccine design could surmount the high viral diversity. Results obtained in the distance analysis on both the nucleotide (Fig. 3) and the translated amino acid (Tables 1 and 2) levels across the entire genome of HIV-1C and each viral protein convincingly justify the rationale for the consensus-based vaccine, although the concept might await evaluation in an efficacy trial.
The cumulative genetic information of a relatively large number of near-full-length genome sequences had sufficient power to segregate HIV-1 subtype C into multiple lineages. It is worth mentioning that, recently, lineages within HIV-1C have been found also among incomplete (i.e., env C2-V5 or C2-V3) sequences from India (47) and Ethiopia (1, 2). In this study, within the 73 near-full-length genome sequences of HIV-1C, numerous lineages were supported by (i) high bootstrap values by both MP and NJ methods, (ii) nearly identical content and topology of the lineages in MP, NJ, and ML analyses, and (iii) shorter pairwise distances between sequences within a lineage. In addition, certain of these lineages were further supported by (iv) accord with geographic area (Indian lineage, South African lineage), (v) consistent topology when different outgroups were used, and (vi) consistent topology when some sequences were excluded.
It is important to recognize that the appearance of lineages within HIV-1 subtype C might depend on the sampling. For example, without clone 94IN476.104, the two sequences 98BWMO37.d5 and 00BW3970.2 would have shared 208 (107 + 101) sites and could have received 100% rather than 90% bootstrap support. Some lineages had a short branch length at the base and might be unstable (e.g., collapse upon adding new isolates or disintegrate within shorter regions). Furthermore, based on data that were generated for a subset of sequences from Botswana (39), the lineages demonstrated no specific patterns related to the sequence segregation by viral load, CD4/CD8 counts, and/or HLA class I types. The number of NF-
B sites, the size of the insertion at the N terminus of the Vpu, or the extension at the C terminus of the Pol could not be assigned either independently or collectively for the lineages within HIV-1C.
New lineages might be identified in the future. For example, all but one Indian sequence and all but one sequence from South Africa form separate lineages. The out-of-main-group sequences from India (94IN476.104) and South Africa (ZA.CTSc2) proved that there were at least two distinct lineages of HIV-1C in India and in South Africa. In fact, some new HIV-1C sequences from India form additional lineages within the subtype C tree (49). Perhaps every sequence outside of an identified lineage might be seen as a potentially underrepresented lineage. Additionally, unidentified lineages (i.e., that cannot yet be distinguished due to a small sample) might fill their niche in the HIV-1C tree; also, new lineages might evolve within the HIV-1C.
An assumption about the beneficial effect of sequence homology between vaccine candidate and infecting virus raises a few issues related to the HIV-1C lineages. First, should a vaccine include different lineage sequences? If the epidemic is represented by multiple lineages, like the epidemic in Botswana, this may be worthwhile. Second, a representative consensus sequence as a vaccine for southern Africa and/or India might be a better choice irrespective of the number of lineages identified. Third, the notion of lineages within HIV-1C should be adjusted for the probability that available viral sequences adequately represent circulating viruses in a given geographic area. Fourth, a consensus approach to an AIDS vaccine design might overcome lineage clustering within HIV-1C. Finally, assuming that the available sequences accurately represent the epidemic, a superconsensus based on the consensus sequences of lineages instead of individual isolates might be a more appropriate candidate for the vaccine (B. Korber, personal communication).
The evaluation of the predictive power of the HIV-1C consensus sequence provided additional evidence for the beneficial use of the consensus-based vaccine. Five percent was found to be the average deviation of HIV-1C sequences from the consensus sequence. Moreover, according to the prediction made based on the present data, a new sequence would not alter the distribution of distances to the consensus sequence. Reiteration of the results of the sequence relationships to the consensus, the high probability that a new sequence will be within the identified consensus sequence, and the consensus consistency and stability to the introduction of a new isolate strongly suggest that a vaccine construct based on a HIV-1C consensus sequence could be superior to any particular viral isolate.
Despite potentially surmounting high viral diversity and the demonstrated high predictive power of the HIV-1C consensus sequence, the immunogenicity and protection efficiency of the consensus-based vaccine needs to be assessed in further studies. As a concept, a consensus sequence approach to the vaccine design and development may contain certain limitations. Covariability that was described previously within the HIV-1 V3 loop (7, 27) implies that amino acid substitutions across HIV-1 proteins are not independent. Use of the consensus sequence for the vaccine design might face the problem of generating artificial constructs that do not occur in wild-type virus. Apparently, this problem could be overcome by narrowing the set of residues and by preventing the inclusion of residue combinations that rarely occur in the population. The vaccine constructs could be adjusted for the optimal expression and proper folding of the consensus-based protein. If the vaccine incorporates multiple copies of numerous viral variants that correspond to the minor amino acids in the extended consensus (eCons), the size of such a hypothetical vaccine construct could be dramatically excessive, and it would be unrealistic to generate this kind of vaccine product. Perhaps, if some particular regions within HIV-1C proteins could be identified as promising candidates for the vaccine, a reasonable vaccine construct could be limited to these immunodominant and subdominant regions and could include multiple variants of protein regions to cover the variability of the virus, as well as potentially prevent viral escape from the immune recognition. Further studies should address the feasibility of this approach in detail. Additionally, if a vaccine efficacy trial could provide the data needed for selecting the weights optimally based on protective efficacy, the information on the relative structural and immunological importance of particular positions across the sequence could be incorporated straightforwardly by weighting the distance measures. Thus, the concept of the consensus-based vaccine needs to be tested in regards to the correlates of immune protection.
The sequence summary of HIV-1C proteins was presented in the form of eCons sequence by ranking amino acid frequencies across the proteins and by highlighting the prevailing amino acid residues at each position. The eCons sequence might be seen as a halfway point between the traditional alignment and the simple one-string consensus sequence. While alignment of sequences is a valuable tool for detailed phylogenetic analysis, the eCons might be a more convenient instrument for the analysis of a relatively sizable genomic region of a large sample set. The eCons sequence has also overcome the simplicity of the commonly used one-string consensus, which is good for relatively conserved proteins but is less informative for the variable regions. The eCons might be helpful for analysis in which the frequency of amino acids at a particular position is critical, i.e., characterization of immunodominant regions or epitopes. Moreover, the eCons can be used in the vaccine design by launching multiple copies of viral variants into vaccine constructs, which might increase vaccine coverage for exposure to naturally occurring viruses. A vaccine based on an eCons sequence might theoretically shift the distribution of genetic distances of viruses in the population to the vaccine sequence toward zero and considerably reduce the average deviation of the ordinary consensus sequence. Available via the Internet (http://www.aids.harvard.edu/lab_research/concensus_sequence.htm), an eCons of the HIV-1C proteins might be a useful reference for vaccine formulation, as well as for the generation of synthetic peptides and other reagents to be used in assessing immune responses in relation to potential vaccine efficacy.
In summary, we examined the molecular phylogeny patterns of the HIV-1C epidemic based on near-full-length genome sequences. Most of the analyzed sequences represented southern Africa, a region with the most severe HIV epidemic in the world. A number of lineages were identified within HIV-1C. A consensus approach to vaccine design is suggested to potentially overcome the high genetic diversity of HIV-1C. A generated, extended version of the consensus sequence for all HIV-1C proteins might be a useful tool for vaccine design and to monitor vaccine trials.
| ACKNOWLEDGMENTS |
|---|
This research was supported in part by grants AI47067, AI43255, and HD37793 from the National Institutes of Health and grant TW00004 from the Fogarty International Center, National Institutes of Health.
| FOOTNOTES |
|---|
| REFERENCES |
|---|
|
|
|---|