*
Mia Coetzer,2,
Angélique B. van 't Wout,1,
Lynn Morris,2 and
James I. Mullins1
Department of Microbiology, University of Washington, Seattle, Washington,1 AIDS Virus Research Unit, National Institute for Communicable Diseases, Johannesburg, South Africa2
Received 2 December 2005/ Accepted 22 February 2006
| ABSTRACT |
|---|
|
|
|---|
| INTRODUCTION |
|---|
|
|
|---|
In many cases, however, it is not economically or practically possible to obtain biologically active virus samples to test in vitro, or the only available source of information is a sampling of viral population sequences. Although env (encoding the envelope) is not currently routinely sequenced (unlike pol [encoding the protease and reverse transcriptase], which is sequenced for drug resistance testing), it is likely that this will become more common with the imminent use of CCR5 antagonists in clinical trials, and the clinical interpretation of such sequences will become more important. Sequence-based methods for predicting coreceptor usage, if reliable, can provide a useful surrogate for biological phenotyping in these situations. Some genotype-based methods (such as the one presented here), in contrast to common phenotype-based assays, also score V3 sequences on a continuous scale. Changes in a continuous index have been correlated with shifts in phenotype and could be useful for predicting pathogenic changes within individuals that have not yet emerged biologically (17). This could become important for guiding therapeutic decisions and is a rationale for a larger role of env sequencing in the molecular surveillance of HIV-1.
Studies have shown that certain amino acid sites in the env gene, specifically in the third variable (V3) loop, are involved in coreceptor binding. This region plays an integral role in virus infectivity, and variations in the region have been correlated with changes in cell tropism, syncytium formation, and the progression of disease (6, 19, 21, 31). The V3 loop consists of approximately 35 amino acids with a conserved disulfide bridge at the base. Distinct genetic differences between CCR5- and CXCR4-using viruses have been described that influence coreceptor usage (10, 14, 30). These differences have been exploited in bioinformatic approaches to predict tropism, with various degrees of success (17). These approaches include noting the presence or absence of positively charged amino acids at V3 sites 11 and/or 25 (the "11/25 rule") to distinguish between SI and NSI viruses (14), a multiple regression method based on positive, negative, and net V3 charges (4), a neural network strategy (26), a machine-learning method (23), and a subtype B position-specific scoring matrix (B-PSSM) (16). The PSSM showed improved predictive power over the other methods (17) and was also useful in the analysis of the transition from R5 to X4 in subtype B viruses (16).
For this work, we tested four of these existing predictors of viral tropism on a subtype C data set of V3 sequences with known phenotypes to determine their applicability to subtype C sequences. We found an initially poor performance of these methods, which were developed based on knowledge obtained from subtype B viruses/sequences, in predicting SI virus CXCR4 usage, suggesting that a predictor based on subtype C sequences was necessary. Since the B-PSSM was shown to have an improved positive predictive value (17), we analyzed PSSM predictors constructed from V3 sequences of subtype C isolates of known phenotypes (C-PSSM). Predictions based on the C-PSSM exhibited increased reliability and sensitivity over subtype B-based predictors. We also found that the previously described B-PSSM (16) performed comparably to the C-PSSM when the predictor cutoff score was optimized for subtype C sequences. Further bioinformatic analysis indicated that V3 sites influencing coreceptor usage may differ between the two subtypes.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Our training set constituted our best effort to obtain all available phenotyped subtype C sequences as of June 2004. To secure a validation set on which to test the C-PSSM predictor, we searched the Los Alamos database in January 2006 to obtain new phenotyped V3 sequences not contained in our training set. This search yielded 24 additional X4/SI sequences and 47 additional R5/NSI sequences. These sequences were all distinct. Several V3 sequences occurred multiple times in the database, with each instance associated with a different isolate. In these cases, if phenotypic information was available for more than one accession number, we required that the phenotype be consistent across reports for inclusion in the validation set. Alignments, accession numbers, and other details for all sequences in this study are available in the supplemental material and at http://mullinslab.microbiol.washington.edu/.
Most (>97%) phenotype determinations for the sequences in the training set were obtained by bulk phenotyping methods (see the supplemental material). Specifically, primary patient peripheral blood mononuclear cells or heterogeneous primary isolates were typically cocultured with HIV-negative peripheral blood mononuclear cells, and the resultant viruses were assayed for phenotype. This is not ideal, as sequences obtained from a heterogeneous isolate may not be representative of the phenotype manifested in the assay. For example, NSI viruses may predominate in a mixed isolate, but rarer SI viruses in the isolate may generate syncytia in an MT-2 cell culture. The isolate would thus be assigned an SI phenotype, but a sequence sampled from the isolate would likely belong to an NSI virus. Obtaining sequence and phenotype information from biological or molecular clones would obviate this problem. Unfortunately, a large data set of clonal isolates with associated sequences and phenotypes does not currently exist for subtype C.
More precisely, then, the test that we developed is a predictor of the phenotype of a bulk isolate based on the genotype of a single sequence sampled from that isolate. In one sense, this makes the test more practical, as most phenotyping studies involve bulk isolates. We have also observed for subtype B viruses (M. Curlin, J. I. Mullins, et al., unpublished data) that there is little difference in predictive power for PSSM predictors generated using assay data for either bulk or cloned isolates, provided that the size of the data set is large enough. Moreover, the success in prediction we obtained in this study also suggests that mismatches between sequence and bulk phenotypes are relatively uncommon, possibly due to selection at the level of coculture.
Genotypic algorithms. A subset of the data representing only the unique sequences within 220 subjects (constituting 200 NSI sequences and 23 SI sequences) was subjected to the following four prediction methods: the 11/25 rule (14), a multiple regression method referred to as the Briggs method (4), a machine-learning method referred to as the Pillai method (23), and the B-PSSM, as publicly implemented (16; http://mullinslab.microbiol.washington.edu/computing/pssm). The percentage of sequences with correctly predicted phenotypes was calculated for each algorithm.
Development of C-based scoring matrices. We derived a predictor from PSSM, calculated as described previously (16), based on the subtype C training set of 279 V3 sequences. Distributions of specificities (fractions of correctly predicted NSI sequences) and sensitivities (fractions of correctly predicted SI sequences) were estimated by combining data set bootstrapping with leave-one-out cross-validation. In this procedure, the target sequence and all sequences of that phenotype from the same infected individual were removed from the data set, and single sequences of each phenotype from the remaining individuals were randomly sampled with replacement. The random sample was used to calculate a PSSM predictor, and the PSSM was used in turn to predict the phenotype of the target. The prediction was made by comparing the score of the target sequence to a cutoff score, as follows: an SI prediction was called if the target score was greater than the cutoff, and an NSI prediction was called if the score was less than the cutoff. Cutoffs were calculated as scores which maximized the product of the sensitivity and specificity of predictions made by applying the PSSM to the sequences used to produce the PSSM (16). Resampling was repeated 100 times to obtain an empirical prediction probability for the target sequence. Each sequence in the data set was treated as a target in turn. The Bernoulli random variables represented by the prediction probabilities for each sequence were then sampled to obtain a set of simulated predictions, and sensitivity and specificity were calculated using the simulated predictions. Bernoulli sampling was repeated 100 times to obtain the reported empirical distributions for sensitivity and specificity. All analyses were performed using scripts written in PERL and R (25; http://cran.r-project.org). Scripts are available upon request.
For between-subtype comparisons, we performed this analysis on an HIV-1 subtype B data set consisting of 187 NSI and 70 SI sequences from 107 infected subjects (26).
ROC analysis. We used receiver-operator characteristic curves (ROCs) (1) to compare the C-PSSM and B-PSSM on the basis of the overall ability to discriminate between SI and NSI sequences. We used a C-PSSM based on the entire data set and a B-PSSM based on SI/NSI sequences as described previously (16). For a particular set of target sequences, each target was scored using one of the matrices. For any cutoff score, false-positive results were those NSI sequences whose scores lay above the cutoff, and positive results were SI sequences with scores above the cutoff. The ROC was generated by plotting the pair (false-positive fraction of NSI sequences and positive fraction of SI sequences = 1 specificity and sensitivity) for a set of 100 cutoff scores evenly spanning the range of PSSM scores. To get a sense of the variation in ROC over the set of infected individuals represented in our data, we calculated an ROC for each of 100 sampled data sets comprised of one randomly selected sequence per patient per phenotype.
We analyzed the performances of both matrices on both subtype C and B V3 sequences and plotted the distributions of the areas under the ROC for each of the four cases. The area under the curve is a measure of the test's ability to correctly predict the phenotype of any sequence: an area of 0.5 indicates that the test is not better than a random guess, while a perfect test (for those sequences analyzed) has an area of 1. We calculated the area using Euler's method of integration.
Overlap coefficient analysis. To investigate the potential differences between subtypes B and C, we examined the overlap coefficients (OCs) (13) between SI and NSI amino acid profiles. The training sets described the amino acid frequency distribution for each site. The OC, in this context, is a site-wise measure of the difference between the SI and NSI amino acid distributions for V3 sequences (equations for the OC are given in reference 13). If the OC is 0, the distributions are identical, and if the OC is 1, there is no amino acid overlap between the distributions (i.e., the SI amino acids at a site are completely distinct from NSI amino acids at that site). Thus, the OC is a measure of the ability of a V3 site to discriminate between the two phenotypes based on the training set.
To determine whether an OC was significantly high, we compared it to a distribution of OCs generated by randomly assigning training set sequences to the SI or NSI category. OCs were calculated for 250 random permutations, and the P value of the training set OC was reported as 1 minus its percentile within the random OC distribution.
HTA. The V3-based heteroduplex tracking assay (V3-HTA) has been used as a rapid genotype-based method to identify genetic variation associated with NSI- and SI-like viruses in subtypes B and C (22, 24). The mobility of the heteroduplex reflects the differences between the probe and the sample sequence. This is measured by the mobility ratio k, which is the distance traveled by the heteroduplex divided by the distance traveled by the homoduplex. The greater the genetic difference between the probe and the sample, the slower the migration of the heteroduplex and the smaller the mobility ratio. V3-HTA mobility ratios were available for 13 NSI and 8 SI subtype C viral isolates from South Africa using an NSI probe (9). The C-PSSM score of each of these isolates was calculated and correlated to the V3-HTA mobility ratio to determine the relationship between the genotypic algorithm and a genotype-based molecular assay.
C-PSSM for public use. We have made a C-PSSM predictor available online at http://mullinslab.microbiol.washington.edu/computing/pssm/, based on the computational techniques presented in this paper. The details of this implementation, including improved handling of data set sampling issues, are beyond the scope of this report and will be described separately (M. A. Jensen, unpublished data). Briefly, all sequences in the data set were used to generate the matrix, but the contribution of each infected individual's sequence was weighted by the reciprocal of the number of sequences sampled from that individual. We refer to this as the "sample-averaged" C-PSSM.
| RESULTS |
|---|
|
|
|---|
|
|
|
|
|
The ROCs in Fig. 4A show that the distributions of sensitivity for a given specificity overlap for the C-PSSM and B-PSSM tests on subtype C sequences. Thus, neither test has a clear advantage, although the B-PSSM has a better median sensitivity for low levels of false-positive results. The variances of the ROC area distributions (Fig. 4B) in these two cases differed significantly (Levene-Carroll-Schneider test [1]; F = 6.74; P = 0.01). Since the B-PSSM test response was more variable, this suggests that tests on subtype C sequences using a B-PSSM will be more dependent on the sequences being analyzed. That is, some subtype C virus-infected patients may be very well predicted, while others will be rather poorly predicted, using a B-PSSM, while this effect may be ameliorated by using a C-PSSM.
Finally, the ROCs can be used to compare the C-PSSM to the publicly implemented methods shown in Table 1. Each of the methods examined (excluding the Briggs method) gave an almost perfect specificity; the sensitivities at that level ranged between 47.8% and 52.0%. Inspection of the C-PSSM ROC at the 2.5th percentile shows that a sensitivity of approximately 75% can be attained for a specificity of 100%. The C-PSSM thus represents a significant performance improvement over these methods in this respect.
To determine whether the sensitivity and specificity might be significantly improved by increasing the number of training sequences, we performed leave-one-out/bootstrap analysis on subsets of the data set. The total size of the subsets was increased incrementally, and unique SI and NSI sequences were randomly selected in a ratio of 1:10, comparable to the ratio in the total data set (Fig. 5). The specificity of the C-PSSM for predicting NSI phenotypes was high at even the smallest total sample size, and it declined slightly when the sample number was increased (P = 0.001 for a Kruskal-Wallis test between the smallest and largest sample sizes). The sensitivity of the C-PSSM for predicting SI phenotypes with small sample numbers was poor but appeared to approach a limit as the sample size increased to approximately 100 (Kruskal-Wallis test between sample size pairs; 5/50 versus 10/100, P < 1015; 10/100 versus 15/150, P = 6 x 106; 15/150 versus 20/207, P = 0.025).
|
The fact that the ROC areas for both subtypes fall below the estimated confidence intervals (Fig. 4) indicates that a prediction bias persists in the cross-validation analysis, despite the use of the leave-one-out technique. This bias should not affect the relative comparisons between subtypes that we have made above, but the estimated values for optimal specificity and sensitivity should be considered upper bounds of the actual values.
Site-wise differences in phenotype between subtypes B and C. Because SI viruses are reported to be much less prevalent in subtype C than in subtype B populations, it is possible that different, less evolutionarily labile sites influence the manifestation of phenotype for subtype C viruses. The availability of both subtype B and subtype C training sets afforded us a chance to investigate potential differences using OCs. Figure 6 displays the P value for the OC for each V3 site, comparing subtype B (using the SI/NSI data set of Resch et al. [26]) and subtype C sequences (using our data set). Sites with OCs that exceed the 95th percentile of the random permutation distribution (depicted in Fig. 6 as bars that extend beyond the dotted lines) have amino acid distributions that are significantly different between SI and NSI viruses. Sites with nonsignificant OCs are less informative for purposes of discriminating between SI and NSI viruses by genotype. Under this interpretation, the OC analysis highlights sites that are potentially different in their influence on phenotype between the two subtypes. In particular, V3 sites 12, 15, 16, 26, 27, 28, 33, and 34 (shown in gray in Fig. 6) have significant OC values in one but not the other subtype.
|
| DISCUSSION |
|---|
|
|
|---|
Sequence logos highlighted appreciable differences between V3 sequences of NSI and SI subtype C viruses. The NSI data set was very homogeneous, with little or no variation at many of the amino acid sites, while in the SI data set there was greater variation at most sites. These data suggested that sufficient genetic variation between NSI and SI subtype C sequences exists to allow the sequences to be used to differentiate these phenotypes. However, none of the available prediction methods were able to adequately exploit these differences in differentiating NSI from SI viruses. The 11/25 rule is based on the presence of positively charged amino acids at positions 11 and/or 25 (14, 20). However, >50% of the subtype C SI sequences in this study did not have a positively charged amino acid at either of these positions (Table 1, percent correctly predicted for the 11/25 method). Furthermore, while the 11/25 method is considered a reliable sequence-based phenotype predictor, other studies have shown that more than two amino acid positions need to be considered when assigning phenotype (26). The Briggs method performed the least well of all the algorithms evaluated. This method is based on genotype variables in the V3 region derived from subtype B sequences, with NSI viruses having a net charge of <4 and SI viruses having a net charge of >4. The net charges of the subtype C NSI data set ranged from 0 to 6, and the SI data set had net charges from 3 to 9, which probably explains the poor performance of this algorithm. The Pillai method is a two-way classification method that differentiates between viruses able or unable to use CXCR4. The limitations of this method include the fact that it misclassifies R5X4 viruses (23). These dual-tropic viruses may represent an intermediate stage of coreceptor evolution in subtype B viruses (16, 33, 34) and are usually grouped into the SI data set, as was done in this study. Our SI data set contained nine dual-tropic viruses, six of which were incorrectly predicted as NSI using the Pillai method.
While most methods could accurately identify NSI viruses, it was clear that a new method was needed to improve the sensitivity of prediction of SI viruses from subtype C sequences. The subtype C-specific PSSM addressed some of the limitations of other prediction methods, including the B-PSSM as publicly implemented (16). In particular, the C-PSSM identified SI viruses more reliably than the other methods, resulting in a significant increase in sensitivity when low levels of false-positive results are required. We also found that the B-PSSM performed comparably to the C-PSSM when the cutoff score was optimized for subtype C sequences. In this case, however, the variability of performance, as measured by ROC areas, was greater for the B-PSSM.
It was previously suggested (16) that the PSSM score represents the "X4 potential" of a sequence, in that intermediate scores track the temporal evolution of viruses within an individual. This method has also contributed to a better understanding of the role of intermediates (R5X4) in the transition from R5 to X4 in subtype B viruses (16) and can now be applied to subtype C viruses, where this transition has not often been reported. For this application, and to improve the prediction quality of the C-PSSM, vigorous sampling of more patients will be required, in at least the present ratio of SI to NSI sequences. Therefore, future sampling should focus on the acquisition of more X4 sequences from new individuals.
Potential site differences between subtypes B and C within the V3 loop were investigated by overlap coefficient analysis using current training sets. This suggested that changes in the crown at site 15 will influence coreceptor usage in subtype B viruses but that changes at site 16 will have more influence in subtype C viruses. Other sites that had significant OC values in one but not the other subtype were sites 12, 26, 27, 28, 33, and 34. These are unlikely to be simple artifacts of sampling, since the majority of sites are congruent (either both significant or both nonsignificant) between the subtypes and since both possibilities (significant for subtype B and nonsignificant for subtype C and vice versa) are represented at the incongruent sites. However, we are not claiming that incongruent sites constitute a rejection of any explicit null model. Rather, this simple analysis suggests the possibility of differential phenotypic effects of mutations at certain sites that could be evaluated in future studies.
The C-PSSM represents an improvement over currently available methods for predicting SI viruses in subtype C populations. This could lead to improved detection of SI viruses and to insights into the pathogenic role of coreceptor usage in this subtype. With the increased availability of small-molecule fusion inhibitors and the use of antiretroviral therapy in patients infected with subtype C viruses, there is concern that certain therapies may increase the risk of developing X4 viruses during subtype C infections. PSSM scores may be a useful tool for assessing baseline risk for this possibility (17) and may have prognostic value for treatment outcomes in general (5). Although the number of HIV-1 subtype C SI viruses available is a limiting factor, this study has shown that currently available data provide a good initial basis for a subtype C coreceptor usage predictor.
| ACKNOWLEDGMENTS |
|---|
This work was funded by grants from the South African AIDS Vaccine Initiative (SAAVI), The Wellcome Trust, and the Poliomyelitis Research Foundation and by grants to J.I.M. and A.B.W. from the U.S. Public Health Service, including grants to the University of Washington Center for AIDS Research. L.M. is a Wellcome Trust International Senior Research Fellow in Biomedical Science in South Africa. M.C. received travel support from the Fogarty Training Fellowship (TWO-0231). M.A.J. was supported in part by NIH award GM33782 to Bruce R. Levin.
| FOOTNOTES |
|---|
Supplemental material for this article may be found at http://jvi.asm.org/. ![]()
M.A.J. and M.C. contributed equally to this work. ![]()
Present address: Department of Clinical Viro-Immunology, Sanquin Research, Amsterdam, The Netherlands. ![]()
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| J. Bacteriol. | Mol. Cell. Biol. | Microbiol. Mol. Biol. Rev. |
|---|
| Clin. Vaccine Immunol. | ALL ASM JOURNALS |
|---|