Previous Article | Next Article ![]()
Journal of Virology, August 2002, p. 7595-7606, Vol. 76, No. 15
0022-538X/02/$04.00+0 DOI: 10.1128/JVI.76.15.7595-7606.2002
Copyright © 2002, American Society for Microbiology. All Rights Reserved.
Algonomics NV, 9052 Ghent,1 Unit of Virology, Department of Microbiology, Institute of Tropical Medicine, 2000 Antwerp, Belgium2
Received 26 October 2001/ Accepted 26 April 2002
|
|
|---|
|
|
|---|
The mechanism of fusion of gp41 is not well understood but may be similar to fusion processes induced by conformational changes in the envelope protein hemagglutinin (6). The following model of gp41-mediated membrane fusion has been proposed (10, 51). Initially, gp41 exists in a prefusogenic conformation within the trimeric envelope glycoprotein spike. Binding of gp120/gp41 to CD4 induces initial conformational changes in gp120 that expose the coreceptor binding site, and the subsequent binding of gp120 to the coreceptor initiates the membrane fusion process itself (33, 43). Next, a transient pre-hairpin intermediate (prefusogenic state) is formed by exposure of the fusion-peptide region and concurrent formation of the N-terminal coiled-coil trimer (23). Subsequently, the N-terminal coiled coil and the C-terminal helix are assembled into a stable fusion-active (fusogenic) hairpin structure, leading to the local apposition of viral and cellular membranes (6, 50) and subsequent membrane fusion.
The folding of gp41 into its fusogenic conformation, an obligate step in virus entry into the target cell, implies that the conformational properties of both the prehairpin as well as the trimer-hairpin structures may play a critical role in driving membrane fusion. Hence, this motivates research efforts aiming at better understanding the conversion as well as the stability properties of these structures. As these properties are in turn determined by the underlying amino acid sequence of e-gp41, it is important to address the structure-sequence relationship in e-gp41.
HIV-1 is characterized by an unusually high degree of genetic variability in vivo (45). HIV-1 rapidly mutates during infection, resulting in the generation of viruses that can escape immune recognition or become resistant to the drugs that are administered to the patient. To develop successful effective strategies attacking HIV, it may be mandatory to target regions in the viral proteins that show a higher degree of sequence conservation than other regions. In view of the packing constraints in the triple-hairpin structure of e-gp41, this molecule may be an ideal target and undoubtedly this explains the current focus on e-gp41 as a target for drug discovery (5, 20, 21, 32, 44).
Most information on gp41 substitutions was obtained from sequence comparison (18) and from experimental studies (31, 34, 52) addressing changes in stability and in inhibitory activity between wild-type and mutant proteins. As it may be too time-consuming to test experimentally all possible mutations in a protein, we believe it is useful to employ predictive methods aiming at reducing the number of substitutions to be evaluated experimentally.
For that purpose, we used a novel tool, referred to as the FCD generator, for computer-aided design of single-site substitutions (D. Vlieghe, C. Boutton, J. L. Verschelde, I. Lasters, and J. Desmet, submitted for publication) that is based on the recently published FASTER algorithm (17), a new powerful high-throughput algorithm for side chain placement (16). FASTER searches in an iterative way the energetically most comfortable conformation, the so-called Global Minimum Energy Conformation (GMEC), of an arbitrary large collection of protein side chains positioned on a given protein backbone structure. The speed of the FASTER algorithm makes it possible not only to search for the most stabilizing conformation of the side chains but also to assess the energetic compatibility values of different amino acid types at any position throughout the protein, storing these values in a so-called Fold Compatible Database (FCD). More precisely, this database contains for each residue position in a given protein the energy cost of mutating this residue into each possible natural amino acid. These energy values are called Energy Compatible Objects (ECO) and are determined after a full relaxation of the protein environment, allowing the protein to adapt to the introduced mutation. Several methods to predict the response of a protein to point mutations have been published earlier. Some of them are just qualitative (53), and others try to be quantitative by statistical means (47) or by using known energy potentials (24). The advantage of the FCD over the other computational approaches is the fact that an all-atom physical energy function is used and that no average is taken over other protein folds like is done in knowledge-based prediction methods. Since ECO values estimate the compatibility of an amino acid with the current protein fold, ECO values can be seen as the theoretical analogs of experimental 
G observations. However, to underline that the FCD values correspond to modeling predictions, we refer to these values as ECO values and not as 
G values.
In this report, we describe the use of the FCD concept to explore the sequence variation that is compatible with the HIV-1 e-gp41 triple-hairpin structure as well as the pre-hairpin structure. Starting from a reference e-gp41 structure in the Brookhaven Protein Data Bank (PDB) (3), code 1AIK (9), all possible single amino acid substitutions were generated in silico and the ECO value of each substitution with the e-gp41 scaffold was evaluated. Using the ECO values equipped with a suitable threshold parameter, we studied the correlation of our predictions with the sequence variation as observed from patient data and from a large public database. While we realize that ECO calculations based on single amino acid substitutions have inherent limitations in their predictive value, the present work follows a clear systematic, scientific path wherein, before studying specific combinations of mutations, we address to what extent the e-gp41 observed sequence variation can be explained by considering all single (independent) substitutions within the context of a reference of fixed sequence.
|
|
|---|
Genotypic and phenotypic characterization of biological clones. Starting from cell-free virus supernatant of biologically cloned virus, the RNA extractions were performed as previously described (4). Viral RNA was transcribed into DNA by using the one-tube Reverse Transcriptase kit (Titan One Tube RT-PCR kit; Roche Diagnostics, Brussels, Belgium) according to the manufacturer's recommendations. For the first round of PCR of the group M viruses, primers SQ-S2 (5' TACAGGGCTACTATTAACAAGAGA 3') and WOU29 (5' TGTAAGTCATTGGTCTTAAAGGTACCTG 3') were used. The cycle protocol was 45 min at 48°C (cDNA reaction) followed by 2 min at 94°C; 40 cycles for 30, 30, and 120 s at 94, 50, and 68°C, respectively; and one cycle of 7 min at 68°C. Nested PCR was done using the Expand High Fidelity PCR system (Roche Diagnostics) according to the manufacturer's recommendations. The primers used were H1E7169 (5' CTGGAGGAGGAGATATGAGGGACAATT 3') and WOU28_Not (5' ccgGCGGCCGCTTTGACCACTTGCCACCCAT 3'). The cycle protocol was three cycles of 60, 60, and 60 s at 94, 55, and 72°C, respectively; 32 cycles of 15, 45, and 60 s at 94, 55, and 72°C, respectively; and one cycle of 7 min at 72°C. For the first round of PCR of the group O viruses, primers O-7755S (5' GACTCTATGCACCTCCCATC 3') and A70E9047 (5' AGGGCTGCATTGTTTTGAGG 3') were used. The cycle protocol was 45 min at 48°C (cDNA reaction) followed by 2 min at 94°C; 40 cycles for 30, 30, and 60 s at 94, 50, and 68°C, respectively; and one cycle of 7 min at 68°C. The primers used for nested PCR were A70E300 (5' TGAAAGATATATGGAGAACTGA 3') and A70E8967 (5' AAAGTCGACCTGCAGAGGTGCACATGGTTCAGGCTC 3'). The cycle protocol was three cycles of 60, 60, and 60 s at 94, 55, and 72°C, respectively; 32 cycles of 15, 45, and 60 s at 94, 55, and 72°C, respectively; and one cycle of 7 min at 72°C. Sequence analysis of parts of the env/gag genes were performed to confirm the identity of the biological clones. Both DNA strands of a base pair fragment encoding part of the env product gp41 were sequenced. Phylogenetic analysis was performed using the TREECON software as described previously (49). Syncytium formation was determined on an MT2 cell line as described previously (2). Determination of coreceptor usage was performed as described previously using GHOST cell lines (8).
In addition to the sequences determined at the Institute of Tropical Medicine (ITM), nucleotide sequences were determined by BaseClear (Leiden, The Netherlands) by using double-stranded sequencing. Quality of the returned sequences was verified with the APES software (42), which extracts reliable nucleotide sequences from trace files generated by automated sequencers. We also used this tool to disambiguate nucleotides that were not fully resolved by BaseClear's software. Using standard alignment tools, the nucleic acid sequences were aligned and subsequently translated into the corresponding amino acid sequence in the gp41 reading frame.
Generation of compatibility data for structures of e-gp41. In this study, the three-hairpin and the pre-hairpin structures of e-gp41 were addressed. Several structures of the gp41 core fragments lacking the fusion peptide, the disulfide-bonded loop, and the membrane-spanning sequence have been solved by X-ray crystallography and nuclear magnetic resonance. All these structures correspond to the fusogenic hairpin structure. We selected, as a reference for later conformation, the crystal structure of HIV-1 e-gp41 with PDB code 1AIK (9) for full single-amino-acid substitution analysis. This helical complex, solved at a resolution of 2.0 Å, is a three-fold symmetrical complex wherein each unit is composed of the peptides N36 (amino acids 546 to 581; residues are numbered according to their position in gp160) and C34 (amino acids 628 to 661). As no crystal structure is available for the pre-hairpin state, we chose to take the triple coiled-coil N36-core structure of 1AIK as a model for this intermediate conformation, since the N and C domains are exposed in this open structure. Of course, such a model is necessarily limited to one part (the N helices). To emphasize that this model lacks the C helices, we refer to this model as the N-core structure.
Using our FCD algorithm (D. Vlieghe, C. Boutton, J. L. Verschelde, I. Lasters, and J. Desmet, submitted for publication), which is based on our recently published FASTER paper (17), we computed for both states of gp41 the energetic compatibility (ECO) of all naturally occurring amino acids at each position in the structures. The ECO is defined as the difference between the global energy of the reference structure and the global energy of the point-mutated protein. Under this definition, at any position, the wild-type (wt) amino acid (from the reference structure) is characterized by a zero ECO value. Negative or slightly positive ECO values correspond to amino acid substitutions that are expected to be energetically compatible with the given protein fold. Conversely, for amino acid substitutions marked by higher positive ECO values, i.e., ECO values beyond a certain ECO threshold, one would expect that these would be incompatible with the underlying scaffold. The energy function used is the CHARMm force field as is the standard used in the Brugel package (14) supplemented with additional terms to account for solvation effects (D. Vlieghe, C. Boutton, J. L. Verschelde, I. Lasters and J. Desmet, submitted for publication). Taking into account the three-fold symmetry relation between the hairpin units, the structure of e-gp41 is systematically substituted by side chain replacements and side chain optimizations but the backbone conformation is assumed to be constant during the optimization process. To account for some limited main-chain flexibility, a set of perturbed backbone conformations is generated, clustered around the reference structure. These perturbed backbone conformations are prepared during a 100-ps restrained molecular dynamics simulation of the original structure, from which 50 snapshots are taken, followed by a restrained minimization procedure using the Brugel modeling program (14). The restraining forces are applied on the distances between two atoms by using a multiplication factor of 2.5 kcal/Å and the steepest descent minimization is terminated after 10,000 iteration steps or when the root mean square of the forces is below 0.02 kcal/mol/Å. Hence, each ECO is represented by a collection of 51 energy values, of which the minimum (minECO) is used to judge whether the gp41 protein scaffold is apt to tolerate a given amino acid type at a given residue position. The FCD algorithm operates on an SGI (IRIX 6.5) machine, taking a total of about 30 h to complete one FCD generation for the N-core structure and about 110 h for the triple-hairpin structure.
Relative entropy as measure of information content calculated at each position in HIV-1.
Relative entropy calculations are useful for identifying patterns in biological sequences (19) and are used here as a way of measuring the amino acid conservation at each position in e-gp41. At each position (pos), the probability Ppos(i) of each of the 20 amino acids (i) is calculated by using the Boltzmann equation (Eq. 1), where kT = 1 and Eipos denotes the minECO recorded in the FCD for amino acid i at position pos:
![]() | ((1)) |
Given the probabilities Ppos(i), the relative entropy Hpos (Eq. 2) (19) is defined as follows:
![]() | ((2)) |
Nucleotide sequence accession numbers. The HIV-1 gp41 nucleotide sequence data were deposited in the EMBL, GenBank, and DDBJ nucleotide sequence databases under the following accession numbers: AJ427989 to AJ428023.
|
|
|---|
Genotypic and phenotypic characterization. For the genetic verification of the obtained biological clones, at least one clone derived from each of the primary and laboratory isolates was examined through either sequence analysis or a heteroduplex mobility assay (15). For all clones, genetic analysis was focused on the env gene except for the clones obtained from VI525 and VI526, where parts of both the env and gag genes were analyzed. The genetic subtype of the biological clones was compared to the genetic subtype of the original primary isolates with respect to the same region in that gene.
The subtypes of the env gene coding for part of gp41 for the biological clones that are listed in Table 1 were determined by phylogenetic analysis. The subtyping of the biological clones was done according to preexisting env subtype information as reported for the various primary and laboratory isolates. Phylogenetic analysis also revealed high homology between the original isolates and their derived clone(s) (data not shown). However, for the biological clones derived from VI525 and VI526, discordance in subtypes was found. Although VI525 and VI526 were originally subtyped as G in the env gene and subtype H in the gag gene (35, 36), we found other results. In total, 6 biological clones were derived from VI525 and 12 biological clones were derived from VI526. For VI525, only one clone was subtyped as G for the env and H in the gag gene, just as for the original primary isolate, while five out of six clones were subtyped A for both the env and gag genes, indicating a mixed infection. For VI526, 3 out of 12 clones were subtyped G for env and A for the gag gene, 8 out of 12 were subtyped A for both the env and gag genes, and 1 out of 12 was subtyped A for the gag gene and remained unclassified for the env gene.
|
View this table: [in a new window] |
TABLE 1. Set of nonredundant patient sequences of infectious HIV-1 e-gp41 clones
|
Sets of nonredundant e-gp41 amino acid sequences. For 35 HIV-1-infected clones derived from HIV-seropositive patients, the gp41 fragment was sequenced both in house as well as by BaseClear (Leiden, The Netherlands). Using standard sequence alignment methods and guided by visual inspection of the alignment, the N-peptide and C-peptide DNA regions were identified and subsequently translated into amino acid sequences by using the gp41 reading frame. For a few clones, the alignment showed insertions. Since these could not be handled by our current modeling tools, those sequences were necessarily discarded. Finally, we applied a redundancy filter at the level of the obtained amino acid sequences. This filter safeguards that only unique sequences are retained and is used to avoid bias in the analysis of the prediction scores. Table 1 shows the alignment of the resulting sequence data set, referred to below as the "patient sequence set." This set contains 25 N and 33 C nonredundant amino acid sequences. This table also lists the origin of the patient from which the HIV-1 isolate was obtained. It is clear that the majority (about 70%) of the patients are of African origin. Table 1 also includes the 1AIK sequence, used as a reference in this study. It is clear that this reference sequence, subtyped B for the env gene, resembles most the European sequences and the other group M subtypes. The fact that the nonredundant set contains more C sequences than N sequences suggests that the C helix, which in the triple-hairpin structure surrounds the N core, is marked by a higher sequence diversity and concomitantly by a larger number of substitutions per sequence, as illustrated in Fig. 1.
![]() View larger version (28K): [in a new window] |
FIG. 1. Frequency of sequences found in patient sequence set as function of number of substitutions per sequence for N sequences (A) and C sequences (B). The origin of the patient from which the HIV-1 isolate was obtained is indicated: E.U, Europe; U.S, United States; A.F, Africa.
|
As the outcome of the retrospective analysis was dependent on the quality of the experimental set, it was crucial to work with sequences that were expected to be highly reliable. For this purpose, we defined a third set comprising, in addition to the patient sequence set, all sequences that were found at least two times in the blast search. The latter criterion is based on the universally accepted principle that independently observed and thus reproducible data are more accurate. However, it is noted that sequences not selected by this criterion are not necessarily bad data. This set will be referred to as the "validated sequence set" and contains 236 nonredundant peptide sequences partitioned in 68 N and 168 C sequences.
In the patient sequences, 53% (37 out of 70) of amino acid positions are mutated, resulting in a total of 83 different amino acid substitutions. If only the validated sequence set is taken into account, 69% (48 out of 70) of the positions are mutated at least once, totaling 152 different amino acid substitutions. Considering all the sequences, 93% (65 out of 70) of the positions are mutated at least once, totaling 308 different amino acid substitutions.
Correlation between predicted and observed sequence variation. The different variants of e-gp41 from the patient, validated, and full-sequence sets were correlated with the predicted sets of compatible mutations derived from the FCDs of the triple-hairpin and the N-core structures. This analysis involves the usage of a threshold parameter on the compatibility (minECO) values. All amino acid substitutions having a minECO lower than a chosen threshold were considered to be compatible with the underlying scaffold. For both forms of e-gp41, the percentages of observed substitutions for the three sequence sets that were predicted to be fold compatible by the FCD continuously increased when higher threshold values (1 to 5 kcal/mol) were chosen (Fig. 2A and B). As the threshold was raised from 1 to 5 kcal/mol, more and more amino acid variation was found to be compatible with the underlying scaffold, as is shown in Table 2 for the FCDs of both the triple-hairpin and N core structures. Evidently, as the threshold rises, the FCD is bound to become more permissive, tolerating more sequence variation. In the limit of an infinite threshold, the FCD is fully permissive and any amino acid change would be qualified as scaffold compatible. For any minECO threshold, we define the permissiveness of the FCD as the fraction of amino acid changes in the FCD having a minECO value smaller than or equal to the given minECO threshold. To assess to what extent the observed amino acid variation is specifically explained by the FCD, we introduce the notion of preference factor. At any minECO threshold, the preference factor is defined as the ratio between the number of observed substitutions that are in agreement with FCD values and the expected number of these substitutions that would be explained by the FCD just in view of the permissiveness of the FCD. Clearly, at an infinite minECO threshold, the preference factor is necessarily unity. Despite the fact that, at higher minECO thresholds, more of the ECOs are considered to be compatible with the current fold, the preference factor relative to random situation is still significantly higher than would be expected from the FCD permissiveness (Fig. 3), suggesting that the FCD is capable of recognizing the natural sequence variation that is compatible with the e-gp41 structures. For minECO thresholds higher than 5 kcal/mol, the prediction scores start saturating while the preference factor monotonically decreases to 1 (data not shown). This suggests that for minECO thresholds higher than 5 kcal/mol, we gradually move towards a situation wherein the FCD loses specificity. For example, at a minECO threshold of 15 kcal/mol, all prediction scores are 100% with a preference factor of 1, meaning that predictions at this high threshold are the necessary consequence of the full permissiveness of the FCD at such a high minECO threshold.
![]() View larger version (31K): [in a new window] |
FIG. 2. Percentage of observed substitutions for three sequence sets that were predicted to be fold compatible by FCD. (A) Triple-hairpin structure. (B) N core. (C) Percentage of expected substitutions at thresholds of 2 and 3 kcal/mol, considering a set the same size as the patient sequence set but randomly sampled from the full-sequence set for the N helices.
|
|
View this table: [in a new window] |
TABLE 2. Amino acid substitution compatibility
|
![]() View larger version (33K): [in a new window] |
FIG. 3. Preference factors computed at various minECO threshold levels (x axis) for the triple-hairpin (A) and N-core (B) structures in the patient sequence, validated-sequence, and full-sequence sets. The preference factor, defined in the text, describes to what extent the observed sequence variation is specifically explained by the FCD.
|
Comparison of predicted and observed sequence variations in patient sequence set. The set corresponding to a minECO value of 5 kcal/mol was compared with the patient sequence set. Out of the 83 substitutions, 74 (89%) were FCD compatible with the trimeric hairpin structure of 1AIK and 9 (11%) were predicted to be destabilizing (Table 3). With regard to the N-helix part of the trimeric hairpin structure, it was found that 17 out of 23 (74%) of the substitutions were FCD compatible, whereas 57 out of 60 (95%) of the C-helix substitutions were FCD compatible. Also, at lower minECO thresholds, it was observed that the C-helix substitutions were more FCD compatible than the N-helix substitutions (data not shown).
|
View this table: [in a new window] |
TABLE 3. Substitutions present in infectious sequencesa
|
Two of the badly predicted substitutions according to our criteria can be considered borderline cases, with ECO values of 5.1 and 5.02 kcal/mol for L565M and Y638I, respectively. Most of the other badly predicted mutants appear to correlate with HIV isolates that are highly variable in sequence compared to our reference sequence. For example, the variants VI526_2, ANT70_1, VI686-1, and CA9_4, containing the A561T and/or the Q577R substitutions in the N sequence (Table 1), also contain other substitutions that are spatially close but located in their related C sequences (Table 1; the C sequence of VI526_2 is identical to VI525_1). The inconsistency with FCD predictions at these positions of the N helix packing against residues of the C helix could be attributed to correlated mutations between the N and C helices. This result suggests that a more pronounced rearrangement of the protein main chain may be necessary to account for all these multiple substitutions. Clearly, as such rearrangements are not encompassed by the current FCD, some inconsistencies with the FCD may arise when analyzing the sequence variation for some of the sequences that show many substitutions with respect to the reference sequence.
Prediction score as function of sequence distance. Figure 4 shows for the sequences of the full-sequence set the percentages of residues that are compatible with the FCD for the triple-hairpin structure by using a minECO threshold of 3 kcal/mol as a function of the distance between each of the sequence and the reference sequence taken from 1AIK. This distance corresponds with the number of substitutions relative to the reference sequence. As expected, the largest distances were observed for the C helices. Interestingly, in the distance regime where the prediction score for the N helices significantly dropped (distance > 12), the scores remained very high for the C helices, indicating that the C helices were more permissive to incorporating amino acid variation as opposed to the N helix which is buried within the triple-hairpin structure.
![]() View larger version (27K): [in a new window] |
FIG. 4. Percentage of residues compatible with FCD for triple-hairpin structure as a function of the distance between each of the sequences (full set) and reference sequence 1AIK. This distance corresponds to the number of substitutions relative to the 1AIK sequence. The minECO threshold used was 3 kcal/mol.
|
Recently, it was found that the T586I substitution in SIV e-gp41 strongly stabilizes the trimer of hairpins (30). In HIV, the implied position corresponds to residue I573, which is involved in the N-N interface. Interestingly, all our FCDs showed that Thr at this position would be destabilizing. To verify whether the FCD can successfully predict the scaffold compatibility for T586I in the SIV e-gp41 context, we generated the slightly asymmetrical 2SIV structure (37), an FCD for the T586I substitution, by the same procedure that was followed for the generation of the HIV e-gp41 FCDs. It is seen that the minECO for the T586I substitution is strongly negative (-8 kcal/mol), in agreement with the experimental observation that the SIV T586I substitution is strongly stabilizing (30).
|
|
|---|
Determination of regions permissive and conservative to mutagenesis. From the FCD, we can determine regions in the triple hairpin that are permissive and less permissive to mutagenesis. For each position, we counted the number of predicted mutations by the FCD for a minECO threshold of 3 kcal/mol (Fig. 5). The N-helix positions 547, 549, 551, 555 to 557, 559, 565 to 566, 568, 571-573, 575 to 576, and 579 and the C-helix positions 628, 631, 635, 642, 645, 649, and 656 all showed fewer than two predicted substitutions and hence were considered conservative. On the other hand, a position may be considered very permissive if more than 10 different amino acid substitutions are predicted. This is applicable to the N-helix positions 546, 550, 553, 560, 563, 564, 567, 577, 578, and 581 and C-helix positions 629, 633, 634, 636, 637, 639 to 641, 643, 644, 647, 650, 651, 654, 655, and 657 to 661. All other positions have intermediate permissiveness.
![]() View larger version (42K): [in a new window] |
FIG. 5. Number of predicted amino acids at each position for minECO threshold of 3 kcal/mol. The classes of permissiveness are defined by thresholds, indicated by the dashed horizontal lines. Very permissive regions (number of predicted substitutions higher than or equal to 10) are marked by black bars. White bars represent the conserved regions (number of predicted amino acid substitutions is lower than 3).
|
|
View this table: [in a new window] |
TABLE 4. Predicted FCD-compatible substitutionsa
|
![]() View larger version (34K): [in a new window] |
FIG. 6. Relative entropy plot computed on FCD of triple hairpin of 1AIK. The arrows highlight the residues in the cavity. The numbers superimposed on this plot correspond to the number of different amino acid types observed in the patient sequence set.
|
A small fraction (11% at a minECO threshold of 5 kcal/mol) of the sequence variation was not in agreement with FCD predictions of the triple-hairpin structure. However, in these cases, the sequences were generally highly variable compared to our reference sequence. We compared the FCD predictions for the validated-sequence set for different groups of residues according to their packing interactions. The percentages of predicted substitutions were the lowest for residues of the N helices involved in N-C interfaces (Fig. 7). In principle, this decreased score could be attributed to correlated multiple mutations between the N and C helices. To test the hypothesis of correlated mutations, we generated, starting from 1AIK, two mutated structures. One contained the single Q577R substitution in the N regions. The second one contained the double mutation Q577R-K574R and the double mutation M629Q-E634Q in the C regions, corresponding to the variants VI686-1, ANT70_1, and CA9_4. Comparing the minimized energies of these two mutated structures (-922 kcal/mol for the Q577R mutant and -879 kcal/mol for the double mutant) to that of 1AIK (-977 kcal/mol), we saw that both mutated structures were less stable than the wild type, suggesting that the three extra mutations (K574R, M629Q, and E634Q) did not compensate for a predicted destabilizing effect of the single Q577R mutant in 1AIK. Furthermore, this analysis suggests that the N and C helices may be packed in some flexible way allowing e-gp41 to accommodate to some of the highly substituted sequences. This hypothesis is supported by the comparison of the structures of SIV (6, 9, 37, 50) and Visna Virus (38) with HIV. If the N-terminal coiled-coil cores are superimposed, the C peptides are shifted by more than 2 Å along the groove, resulting in a reorientation of the C peptides to the inner N core. Such adjustments are not modeled in our current FCD version that operates on a set of slightly perturbed structures not containing the linking loop between the N and C regions.
![]() View larger version (42K): [in a new window] |
FIG. 7. Percentage of observed substitutions for the validated sequence set that are predicted to be fold compatible by the FCD in the trimer of the hairpin structure. The residues are partitioned into the following groups: residues involved in the N-N (A) and N-C (B) interfaces (10, 51), N-helix residues not implied in such interfaces (C), residues of the C helices (10, 51) (D) and the other residues (E), buried residues (ASA 25 Å) (F), and those exposed to solvent (G). The minECO threshold used was 5 kcal/mol.
|
Interestingly, the FCD predictions in the N-C peptide complex of the N helices are better correlated with observed group M subtype sequences. Since our reference scaffold 1AIK belongs to group M subtype B, it can be inferred that there is a high level of structural conservation in the N domain of the different group M subtypes. In contrast, the subtype O N helices may, in view of their more pronounced sequence distance relative the 1AIK sequence, adopt structural adaptations in the triple-hairpin conformation (relative to the group M) to maintain the packing interactions between the N and C peptides (6, 10). To accommodate the sequence differences, the packing arrangement between the N and C helices might be somewhat different between the M and O clades. This hypothesis is supported by the dissimilar crossing angles found between the inner N helix and outer C helix of SIV compared to those of HIV-1 (6, 9, 37, 50).
Comparison of predictions for three sequence data sets. From Fig. 2, it may be suggested that FCD predictions are in better agreement with sequence data that correspond to gp41 variants of well-validated sequences than with sequence variation taken from a large database lacking such rigorous characterization. The results from the random sampling analysis (Fig. 2C) (taking random sets from the full-sequence set that were the same size as the patient sequence set) suggest that the difference in data size between both sets cannot fully explain the difference in score. This view is also corroborated by the preference factors in Fig. 3 showing that these are systematically the highest for the patient and the validated-sequence sets. This analysis suggests that perhaps some of the sequences in the public databases (i.e., those occurring only once) may correspond to noninfected e-gp41 variants archived in the course of routine sequencing work. This hypothesis is supported by the higher score (78%) of predicted FCD compatible substitutions at a threshold of 5 kcal/mol when excluding from the full sequence set all substitutions that occur only once, compared to 70% if all sequences are taken (Fig. 2A).
Comparison between N-core and triple-hairpin FCDs. The FCD for N-core e-gp41 is apparently more compatible with the N-helix sequence variation than the FCD for the triple-hairpin structure. Indeed, we observed that about 74 and 100% of the sequence variation in the N helices can be explained by the triple-hairpin and N-core FCDs, respectively. To judge the meaning of these results, it is useful to complement these scores with the corresponding preference factors. For the minECO thresholds of 1, 2, 3, 4, and 5 kcal/mol, the preference factors for the N-helix sequence variation (taken from the patient sequence set) in context of the triple-hairpin structure are 2.78, 2.58, 2.5, 2.4, and 1.96, respectively. These values are much higher than those of the same preference factors determined for the N-core FCD (1.56, 1.58, 1.73, 1.52, and 1.53) (Fig. 3B). We suggest that this again indicates that within the context of the triple-hairpin structure, the sequence variation that is tolerated on the N-helix part of the structure imposes more constraints on sequence variation than cases where the N helix is more solvent exposed, such as possibly in the pre-hairpin structure.
This view is also confirmed by considering only the predictions that result from considering only negative FCD values (corresponding to single-amino-acid substitutions that are predicted to be more preferred than the reference sequence). For the triple-hairpin FCD, it is seen that 13 and 38% of the possible substitutions in the N helices and C helices, respectively, have a negative minECO value. Interestingly, 23 and 37% of the sequence variation observed for the N helix and the C helix, respectively, in the full-sequence set matches with negative minECO values. As for the C helix, the percentage (at minECO values) of possible substitutions (37%) almost exactly matches the FCD-explained sequence variation (37%); we hypothesize that there may not be a strong pressure on the C helix to select for sequence variation that is restrained to the region of negative ECO values (enhanced stability). Such a pressure may well be applicable for the N helix, as considerably more sequence variation (23%) is explained by the FCD than would be expected from considering the fraction of negative minECO values (13%). Moreover, considering the N-core FCD, it is also seen that the fraction of explained N-helix sequence variation (33%) at negative minECO is considerably higher than the fraction of negative minECO values (23%). Hence, the above inferred sequence pressure may also apply for the pre-hairpin form of e-gp41 and may reflect an intrinsic characteristic of the N helix which is implied in specific packing interactions with neighboring N-helices forming a trimeric coiled-coil structure.
This higher pressure on sequence conservation should be explored in drug discovery programs targeting gp41. More in particular, we believe that the FCD will be of great practical use in the design of proteins wherein well-balanced sequence variation is engineered, based on the FCD compatibility values, scattered over a plurality of residues in e-gp41.
The FCD concept appears to be an efficient tool for restricting the number of substitutions that must be tested experimentally. It can be used to search for substitutions in the triple-hairpin structure that are (de)stabilizing (e.g., favoring [or not favoring] the triple-hairpin structure over the pre-hairpin structure). The reduction will of course depend on the used ECO threshold (i.e., the stringency level that is used). If we would, e.g., like to engineer substitutions that are expected to markedly stabilize the triple-hairpin structure, we could use a low minECO threshold of, say, -2 kcal/mol. This would yield 84 candidate substitutions out of a total of 1,440 single-amino-acid substitutions in the triple-hairpin structure, reducing by 94% the number of substitutions that have to be evaluated in a brute force approach. For future work, we propose applying the FCD concept to identify a limited set of substitutions to engineer pre-hairpin e-gp41 structural variants for use in drug screening programs.
In conclusion, we can state that although we worked with a prediction method developed for single-point mutations, the natural sequence variability of e-gp41 can be very well explained. This suggests that the e-gp41 scaffold can accommodate a large variety of sequences while remaining structurally intact and thereby not jeopardizing the key role that e-gp41 plays in viral uptake by the target cell.
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»