Previous Article | Next Article ![]()
Journal of Virology, March 2004, p. 2242-2246, Vol. 78, No. 5
0022-538X/04/$08.00+0 DOI: 10.1128/JVI.78.5.2242-2246.2004
Copyright © 2004, American Society for Microbiology. All Rights Reserved.
University of Edinburgh, Edinburgh, Scotland,1 University of California, San Diego,2 VA San Diego Healthcare System, La Jolla,11 Harbor-UCLA, Los Angeles,3 ViroLogic Inc., South San Francisco, California,10 Aaron Diamond AIDS Research Center, New York, New York,4 University of Washington, Seattle, Washington,5 University of Colorado Health Sciences Center, Denver, Colorado,6 University of British Columbia, Vancouver,7 McGill University Health Centre, Montreal, Canada,9 Johns Hopkins University, Baltimore, Maryland8
Received 27 May 2003/ Accepted 17 October 2003
|
|
|---|
0.4-fold that of HIV type 1NL4-3 (HIV-1NL4-3) to ritonavir (hypersusceptibility [HS]). There is also substantial variation in replicative capacity (RC) or an in vitro assay of the contributions of protease (PR) and reverse transcriptase to viral fitness. In chronically infected antiretrovirally treated patients, amprenavir HS has been associated with the mutation N88S in PR, but this mutation is not seen in untreated patients. In this study, virus strains from 182 cases of primary HIV infection were analyzed, and a highly significant association between HS and low RC (
10% that of HIV-1NL4-3) was observed (P < 10-6). Multivariate analysis was used to determine the genotypic basis of ritonavir HS, analyzing all polymorphic amino acid sites and insertions from p7gag through PR. Decision tree models developed on the entire Gag-plus-PR data set and on PR alone gave overall correct classifications of 73 and 72%, respectively, on cross-validation. They were also able to predict low RC, with sensitivities of 69 and 62% and specificities of 84 and 70%, respectively. The analysis shows that ritonavir HS in untreated primary HIV infection is not associated with single mutations but with combinations of amino acids at polymorphic sites and that the same genotypes which confer HS to PR inhibitors confer low RC. This supports the view that variation in PR function is directly responsible for variation in fitness among strains in primary infection. |
|
|---|
Hypersusceptibility (HS) to ARV drugs, defined here as susceptibility
0.4-fold that of HIV-1NL4-3, is another phenotype that was first described in the context of patients who had been treated with a failing ARV regimen; HS to nonnucleoside reverse transcriptase inhibitors has recently been shown to be clinically significant (4). In one study, among those who had acquired a virus strain resistant to nelfinavir, >6% were found to be HS to amprenavir (18). In this situation, amprenavir HS was shown to be specifically associated with mutations at amino acid 88 in PR, particularly N88S (18). However, HS has also been described in a subset of individuals who have never received therapy (Wrin et al., Abstr. 5th Int. Workshop on HIV Drug Resistance), in whom this mutation is absent. We show here that, among patients with primary HIV infection, low RC and HS to PIs are directly related, and we obtain a single decision tree model for the genetic bases of both.
|
|
|---|
Data.
All amino acid sites where the most common amino acid was present at a frequency
98% were included in the analysis. Mutations at each site were analyzed using single-letter codes.
Techniques. A variety of analytical methods were investigated, including stepwise logistic regression (SPSS version 10.1) and CART (S-Plus version 6.0) and the related informatics-based methods of decision trees, PART rules, and support vector machines implemented in the Weka package (17). CART trees and C4.5 decision trees produced similar results: the results from the C4.5 decision trees and support vector machines are given here.
The C4.5 decision trees were generated using a cost-sensitive classifier. A range of cost values and tip (leaf) sizes were explored for both the ritonavir and amprenavir analyses, and models were tested by 90-10 cross-validation. In this process, a model is repeatedly generated based on a 90% sample of cases chosen at random, and its prediction is tested on the remaining 10%. The tests were run 10 times on independent samples of the data to give the quoted sensitivities and specificities.
Nucleotide sequence accession numbers The nucleotide sequences have been deposited in GenBank under accession numbers AY518941 to AY519122.
|
|
|---|
![]() View larger version (17K): [in a new window] |
FIG. 1. Distribution of ritonavir susceptibilities in 182 primary-infection patients. Cases of transmitted PI-resistant virus (7) (>10-fold less than the susceptibility of HIV-1NL4-3) were not included in the analysis.
|
![]() View larger version (20K): [in a new window] |
FIG. 2. Relationship between RC and ritonavir susceptibility. Both variables are plotted on log10 scales. The continuous curved line represents the fitted quadratic function, which reaches a maximum at an RC value of 56% (dashed horizontal line). Points to the left of the vertical line are cases classified as HS to ritonavir (susceptibility, 0.4-fold that of HIV-1NL4-3). The points below the solid horizontal line are cases classified as low RC ( 0.1-fold the RC of HIV-1NL4-3).
|
0.1-fold the RC of HIV-1NL4-3), 9 were also classified as HS to ritonavir (exact P, <10-6).
Polymorphism in gag-PR.
The fact that there is a high level of polymorphism in PR in untreated patients has been known for some time (1, 5). Defining a polymorphic site with the criterion that the most common amino acid has a frequency
98%, 20 of the 99 amino acid sites in PR were polymorphic in this data set. In addition, of the 70 amino acid sites available for analysis from p7 and p6, 55 (79%) were polymorphic by the same criterion, and the region included four polymorphic insertions at gag454, gag460, gag478, and gag483 (numbering according to HIV-1LAI, clone HXB2R). This high level of variation is all the more remarkable because in the p6 region the pol reading frame (-1 with respect to that of gag) overlaps with that of the gag polyprotein for 56 amino acids.
Genetic basis of HS. Initial studies were carried out using logistic regression with a modified data set of PR sites alone in which sites with more than one mutant amino acid were represented by a series of "dummy" binary variables, each corresponding to a single allele. A highly significant (P < 10-7) model incorporating five amino acid sites was identified (amino acids 12, 33, 37, 45, and 63). At amino acid 37, two amino acids (E and Y), and at amino acid 63, three amino acids (L, V, and Q) were independently associated with HS. However, this model had a sensitivity (rate of correct prediction of HS) of only 37.5%.
Structured models, such as those used in classification trees (2, 15), can be more powerful than simple logistic models where there are complex relationships between parameters. In addition, the generation of many additional variables in binary models results in overfitting. In contrast, machine learning methods, such as decision trees (16, 17), are very flexible and naturally accept multiple classifications for each variable. In addition, the models generated are explicit and can be compared with other available information. We also investigated another machine learning approach, support vector machines, but they do not readily permit explicit interpretation of the sites used in partitioning the data. We used two sequence data sets from the same samples: the first was based on PR alone and was similar to that used in the logistic regression (above) but with multiple classifications of mutant amino acids. The second data set comprised the first set plus the 55 polymorphic amino acid sites and four polymorphic insertions in p7gag plus p6gag.
Given the frequency of HS in the data set, it was necessary to use cost-sensitive classifiers; otherwise, a misleadingly high overall prediction success rate (76%) could be obtained by misclassifying all HS cases as wild type, despite giving a sensitivity of 0%. Cost values varying between 6 and 8 for the PR data set and between 4 and 6 for the Gag-plus-PR data set were explored to find the model giving maximal sensitivity and specificity. Trees were pruned to improve generality, and the models were tested using 90-10 cross-validation (see Materials and Methods). The overall correct classifications for the optimal PR-based model (Fig. 3A) was 72%; that for the gag-plus-PR model (Fig. 3B) was 73%. For PR sites alone, the sensitivity (correct prediction of HS) was 73% and the specificity (correct prediction of wild type) was 68% (Table 1). For the model based on Gag-plus-PR data, the sensitivity was lower (59%) but the specificity was higher (75%). This represents almost a sixfold enrichment relative to the frequency of HS in the data set.
![]() View larger version (23K): [in a new window] |
FIG. 3. Decision tree model describing the genetic basis for ritonavir HS in primary HIV strains. WT, wild type. (A) Based on PR alone. A cost value of 7.2 and a leaf size of 8 were used to obtain this model (see the text). (B) Based on gag plus PR. Amino acid sites in gag are preceded by g; -, amino acid deletion at the site. A cost value of 5 and a leaf size of 7 were used to obtain this model. The models are interpreted by checking the amino acid sites listed and following the prediction shown. Thus (for both), (i) if amino acid 57 is R, then information from amino acid 10 is used, while if 57 is K, then the wild type is predicted (in this data set, all 23 with this amino acid are wild type); (ii) in panel A, if amino acid 10 is L, position 37 is examined, and if 10 is I or N then the wild type is predicted (of 15 in the data set with the genotype 57K plus 10I/N, again, all are wild type).
|
|
View this table: [in a new window] |
TABLE 1. Ritonavir HS model performance on 10-fold cross-validation
|
Prediction of HS to other PIs. The ritonavir-derived model in Fig. 3A also predicts HS to the other PIs tolerably well (Table 2), with the exception of amprenavir. The sensitivities (the true-positive rates) for saquinavir and indinavir were >60% (cf. 73% for ritonavir), while for nelfinavir it was 100%, indicating that all six nelfinavir HS cases were correctly identified. The lower sensitivity for amprenavir is not unexpected, as amprenavir has a left-shifted susceptibility distribution and nearly twice as many HS cases.
HS and low RC. We have already shown that in this data set there was a strong association between the phenotypes of low RC and HS (Fig. 2). Comparison of the predictions of HS and of low RC for each of the two data sets reveals that the ritonavir HS PR-based model had a sensitivity of 62% for low RC and a specificity of 70%, a ninefold enrichment for this phenotype. The gag-PR based model was actually slightly better (69 and 84%, respectively). We conclude that the genotypic bases of low RC and HS in strains from primary HIV infection are essentially the same. One possible reason for the association could be that it is a consequence of the assay, with strains with low RC inevitably being more readily inhibited. However, this would apply to all drugs, not just PIs, and there is no general increase in susceptibility to other ARVs in these strains (data not shown). These results suggest that the alterations in PR function associated with HS are responsible for the low RCs of these strains.
|
|
|---|
To explicitly incorporate multiple amino acids at the same site, and to permit analysis of more complex models, a decision tree approach was used. In order to test the relevance of sites in gagp6/p7, the data were analyzed with and without the presence of gag sequences. Surprisingly, it was found that inclusion of the gag region in the data set did not improve the performance of the model. Both data sets yielded models that were
72% correct overall on cross-validation, and the Gag-plus-PR model had lower sensitivity on cross-validation than the PR-based model (Gag plus PR, 59%; PR alone, 73%).
One possible explanation for the lack of improvement in the model with the additional data from gag is that there were tight nonrandom associations between mutant amino acids at variable sites in gag and in PR. We therefore performed an analysis of nonrandom association among amino acids at the sites involved in the two models. This failed to identify any associations except one between amino acids 10 in PR and 471 in gag. However, given the level of variability in PR in this data set, the addition of the gag sites may not have provided any further information in the classification of the strains. The 20 amino acid sites included in the PR data set generated 161 distinct PR amino acid sequences among these 182 strains. This suggests that the PR-based model correctly identified the mechanistic basis of HS, while the additional sites from gag merely provided other material to classify the strains.
The extent of functional variation in PR, as defined by assays of susceptibility to PIs, is substantial. In this data set, from which transmitted resistant strains (>10-fold reduction in susceptibility) have been excluded, a 25-fold range in susceptibility to ritonavir was observed. A large range in RC was also observed. The relationship between these two is closest at the lowest values for each: five of the seven with susceptibilities of <0.4 had low RCs. At higher values, the two are not the sameonly 4 out of the observed 13 HS cases with susceptibility values of 0.4 also had low RCs. One possible explanation for very low average RC values is the inclusion of a mixture of viable and completely inviable virus, e.g., with termination codons in coding sequences, in these samples. However, a frequency reached by a termination codon sufficient to cause the effect would be detectable in the consensus genotype, in which mixtures of >25% would become detectable as ambiguities. The frequency of ambiguities in these samples was much lower than required in this scenario (data not shown).
Despite the low measured RCs, it is clear that all of these virus strains had successfully established infections within an average of
70 days prior to being sampled (7). Clearly, they possess the basic requirements to establish an infection in a naïve host, and yet many appear to be a long way from the "optimal" RC value (the median RC for drug-susceptible strains is
70% of that of HIV-1NL4-3). In an earlier study, R0, the basic reproductive rate, in acute HIV infection was estimated to be
20, with a range from 7 to 34 (8). R0 has to be >1 to permit an infection to be self-sustaining. Assuming the laboratory-measured RC is an additive component of absolute fitness in vivo, this would suggest that the low-RC strains could have an R0 of
2, which could still permit them to establish an active infection. Thus, we conclude that the variation in RC is associated with the extreme end of a continuous spectrum of variation in fitness, to which genotypic variation in the PR sequence contributes heavily.
|
View this table: [in a new window] |
TABLE 2. Prediction of HS for other PIs by the ritonavir HS PR-based decision tree model
|
Financial support was received from the Universitywide AIDS Research Program, University of California, grant numbers PH97-SD-201 and PH97-CS-202; Center for AIDS Research grant number AI 36214; General Clinical Research Center National Center for Research Resources grant M01-RR00425 and M01-RR00102; grant numbers AI 41534, AI 43638, AI 47033, AI 47745, and TW00767 (Fogarty Center) from the National Institutes of Health; the Research Center for AIDS and HIV Infection of the San Diego Veterans Affairs Medical Center; the Canadian Institutes of Health Research and Fonds de Recherche en Sante du Quebec; and an unrestricted donation from Roche Molecular Systems.
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»