Previous Article | Next Article ![]()
Journal of Virology, August 2007, p. 8543-8551, Vol. 81, No. 16
0022-538X/07/$08.00+0 doi:10.1128/JVI.00463-07
Copyright © 2007, American Society for Microbiology. All Rights Reserved.

Martine Peeters,4
Ricardo Camacho,2
Beth Shapiro,3
Andrew Rambaut,3,
and
Anne-Mieke Vandamme1*
Laboratory for Clinical and Epidemiological Virology, Rega Institute for Medical Research, Katholieke Universiteit Leuven, Minderbroedersstraat 10, B-3000 Leuven, Belgium,1 Laboratório de Virologia, Serviço de Imunohemoterapia, Hospital de Egas Moniz, Rua da Junqueira, 126, 1349-019 Lisbon, Portugal,2 Evolutionary Biology Group, Department of Zoology, University of Oxford., South Parks Road, Oxford OX1 3PS, United Kingdom,3 Laboratory Retrovirus, IRD, IRD-UMR 145, 911, Av. Agropolis-BP 64501, 34394 Montpellier Cedex 5, France4
Received 5 March 2007/ Accepted 29 May 2007
|
|
|---|
|
|
|---|
HIV-1 exhibits very high genetic diversity and is classified in three major groups (M, N, and O). Group M, which is responsible for the global HIV-1 pandemic, is further classified into subtypes A, B, C, D, F, G, H, J, and K, each representing distinctive lineages within group M (26). These subtypes have diversified independently following the initial transmission of the HIV-1 group M progenitor to humans. Chance exportation of particular lineages from the initial epidemic region, followed by subsequent local epidemics in previously uninfected regions, likely led to the current global distribution of HIV-1 subtypes (23, 24). Subsubtypes (e.g., A1, A2, A3, and A4 and F1 and F2) are distinctive lineages that are not genetically distant enough to justify designation as a new subtype, and circulating recombinant forms (CRFs) are intersubtype recombinant viruses with a significant epidemic spread (26). According to the Los Alamos National Laboratory database, 34 CRFs are currently characterized, eight of which are mosaic genomes containing gene regions of more than two subtypes (http://www.hiv.lanl.gov/content/index). The database also lists a large number of unique recombinant forms, generated after coinfection or superinfection in a patient with two different subtypes.
|
|
|---|
Since the different subtypes are assumed to have evolved independently, different genome regions are expected to have the same evolutionary history. However, this is apparently not the case for subtype G. The original study describing subtype G reported that some genomic regions within the subtype had greater similarity with subtype A than would be expected for a pure subtype (2). Several subsequent analyses discussed a putative recombinant origin of subtype G (26); however, no previously reported pure subtype could be assigned to those fragments that did not show genetic similarity to subtype A. Hence, it was decided to keep subtype G as a pure subtype in the classification system (7, 8, 26).
|
|
|---|
One of the interesting features of the CRF02_AG epidemiology is that it was already considerably prevalent early in the pandemic, which is not expected for a recombinant strain. By 1999, it was already more prevalent than its supposed parental subtype G lineage in its putative region of origin, West Central Africa (1, 2).
|
|
|---|
In this study, we have tested the validity of the current classification of subtype G and CRF02_AG. Given the possible recombinant history of subtype G and the geographical distributions of both subtype G and CRF02_AG, we hypothesize that subtype G is not a pure subtype but is instead a CRF, with the proposed CRF CRF02_AG as a parental lineage. To test this hypothesis, we performed an extensive analysis of the group M phylogeny, using full-genome sequences from all the currently identified pure subtypes and CRFs. Our results have important implications for understanding the geographical epidemiological history of HIV-1 and raise questions about the current classification of HIV-1 subtypes.
|
|
|---|
Sequencing of new subtype J full-length genome. A previously unpublished subtype J full-length genome (KTB147) was included in the alignment described above. Partial gag and gp160 sequences have been previously reported (35), and full-length sequence was obtained by overlapping of PCR fragments from different genome regions, as previously described (18, 19, 33, 35).
Recombination analysis. To explore putative recombination patterns in the sequences, we used a sliding window approach, which computes a statistic or measure for a successive set of overlapping subregions (windows) of the alignment (28). This allows identification of the putative intersubtype recombination breakpoints of the query sequence by the graphical detection of a change in the phylogenetic signal, which is characterized by a sudden decrease in support for the clustering with a certain subtype and the simultaneous increase of support for the clustering with another subtype. The query sequence is analyzed against the previously described pure subtype reference set. The software Simplot v3.5.1 (15) was initially used to perform similarity and bootscanning analyses of a query sequence against a set of other sequences. The similarity plot measures the similarity/dissimilarity of the query sequence to a set of reference sequences. In the bootscanning plot, the phylogenetic relationship between the query sequence and the reference set is calculated using bootstrap resampling and the bootstrap values are plotted along the genome. In the preliminary analysis, a window size of 500 bp and step size of 100 bp were used, while in the final plots the window size and step size were 350 bp and 50 bp, respectively. This procedure made it possible to maximize the detection of recombination events while maintaining a good phylogenetic signal and, by comparison of the plots, to ensure that the two window sizes generated similar results. In addition, a more rigorous sliding window analysis using a Bayesian phylogenetic approach was performed, to increase our confidence in the inferred recombination breakpoints. Sliding windows of 500 bp, moving in 100-bp steps, were generated using the software SlidingBayes0.94 (21). As this analysis is extremely time-consuming, only one strain of each subtype was included. This involves a phylogenetic analysis using Bayesian inference as implemented in MrBayes v3.1.2 (27). In this analysis, two Monte Carlo Markov chains (MCMCs) are run simultaneously for the number of generations needed for a stationary distribution to be maintained long enough after convergence. Typically, the number of generations was 4 x 106 to 5 x 106, with the initial 10% of these generations discarded as burn-in. To analyze convergence and stability, we used the software Tracer1.3 (A. Rambaut and A. J. Drummond, http://evolve.zoo.ox.ac.uk/), which allowed us to visualize the posterior distribution for each parameter and provided an estimate of the effective sample size, a measure of the number of "effectively independent" samples in each run, as defined by Drummond et al. (4). We also analyzed the convergence using diagnostic measures implemented in MrBayes, in particular the potential scale reduction factor, as defined by Gelman and Rubin (9). We considered a run to have converged when the effective sample size of all parameters was above 100 and when the potential scale reduction factor was approximately 1. We also ensured that the log likelihood reached stability after the burn-in period, which was discarded from the sample. Posterior probabilities for the clustering of the query with the reference strains were plotted along the genome.
The putative recombination breakpoints suggested by the similarity and bootscanning plots and by the posterior probability plots were similar but not identical. Therefore, we used the informative site analysis as implemented in Simplot v3.5.1 to get a more precise breakpoint estimate. Neighbor-joining (NJ) and maximum-likelihood (ML) trees were reconstructed for each of the fragments defined by the recombination breakpoints. The parameters of the evolutionary model were estimated from the data. Finally, 1,000 bootstrap replicates and the zero branch length test were performed to assess the robustness of the clustering. The software PAUP 4.0b10 (30) was used to produce the NJ and ML trees. Phylogenetic analysis was also performed using Bayesian inference, as described above.
Monophyly rules for subtype G and CRF02_AG to discriminate parent from recombinant. A group of sequences is called monophyletic if they form a cluster composed of all descendants from an inferred common ancestor (parent). If a group of sequences do not include all descendants of their inferred most recent common ancestor (MRCA), then those sequences cluster as paraphyletic; they cannot be grouped in a single cluster. In the context of HIV-1 molecular epidemiology, we can expect that the parental subtype will have an MRCA more ancient than that of the CRF originating from it. Therefore, we can expect that the parent pure subtype will be paraphyletic with respect to the CRF, which will cluster monophyletically within the pure subtype cluster.
Within genetic regions where CRF02_AG is currently considered to be of subtype G origin, the parent can be discriminated from the recombinant by investigating their sequence divergence, using the reasoning explained in the previous paragraph. For this purpose, the last 10,000 trees of the posterior distribution of trees generated by each MCMC run, summarizing the phylogenetic uncertainty, were midpoint rooted and the support for all of the following three "monophyly rules," concerning the CRF02_AG/G cluster, was investigated (Fig. 1): (i) monophyly of CRF02_AG plus G, (ii) monophyly of CRF02_AG separately, and (iii) monophyly of subtype G separately.
![]() View larger version (15K): [in a new window] |
FIG. 1. Schematic putative phylogenetic trees of our data set and its classification regarding the monophyly rules defined in Materials and Methods. Rule 1, monophyly of CRF02_AG plus G; rule 2, monophyly of CRF02_AG separately; rule 3, monophyly of subtype G separately. If our hypothesis is confirmed, our output trees should show the pattern of panel b.
|
Nucleotide sequence accession number. The new subtype J sequence (KTB147) was submitted to the GenBank database and assigned accession number EF 614151.
|
|
|---|
![]() View larger version (42K): [in a new window] |
FIG. 2. Recombination analysis of subtype G strains compared to all other pure subtype strains. (a) Similarity (top), bootscanning (middle), and sliding Bayes (bottom) analysis done as described in Materials and Methods, with the gene regions indicated on top and the recombination breakpoints as determined by informative site analysis. (b) ML tree of the genome region between bp 4316 and 5162 as indicated in panel a. (c) ML tree of the genome region between bp 5577 and 6083 as indicated in panel a. The genomic regions illustrated in the tree are indicated in the upper panel. ML trees were generated with PAUP v4b10, as described in Materials and Methods. , midpoint root of the tree; *, zero branch length test with P < 0.001 and NJ bootstrap support of >70; #, zero branch length test with P < 0.001 but NJ bootstrap support of <70.
|
![]() View larger version (62K): [in a new window] |
FIG. 3. Recombination analysis of CRF02_AG strains. Similarity (top), bootscanning (middle), and sliding Bayes (bottom) analysis done as described in Materials and Methods, using as subtype reference sequences all pure subtypes including subtype G (a) and all pure subtypes excluding subtype G (b). The recombinant structure as defined in the Los Alamos database is shown on top. The region indicated corresponds to the nonrecombinant region analyzed in the final Bayesian tree (Fig. 4). LTR, long terminal repeat.
|
There is, however, a small region (bp 1650 to 2350) where CRF02_AG is not closely related to subtype A. Since ML phylogenetic analysis showed no evidence of CRF02_AG being derived from subtype G, as these two groups formed two separated monophyletic clusters (data not shown), this fragment in CRF02_AG may have been derived from another source.
Investigating whether CRF02_AG or subtype G is the parent of the common fragments. Based on the results above, we performed a scanning analysis in which subtype G was used as a query sequence and the reference sequence set used CRF02_AG as the representative of subtype A (Fig. 4a). The resulting plot suggested a pattern of recombination between CRF02_AG, subtype H, and subtype J, which is confirmed by phylogenetic analysis (Fig. 4b and 2c).
![]() View larger version (35K): [in a new window] |
FIG. 4. Recombination analysis of subtype G strains compared to all other pure subtype strains and CRF02_AG (considering CRF02_AG as a putative pure subtype representative of subtype A). (a) Similarity (top), bootscanning (middle), and sliding Bayes (bottom) analysis done as described in Materials and Methods and at the top the proposed recombinant structure. (b) ML tree of the merged genome regions bp 1500 to 2325, 3275 to 5475, and 7275 to 7975 as indicated in panel a. ML trees were generated with PAUP v4b10, as described in Materials and Methods. The phylogenetic tree of the J region is shown in Fig. 1c. , midpoint root of the tree; *, zero branch length test, P < 0.001 and NJ bootstrap support of >70.
|
If subtype G is an A/J recombinant and CRF02_AG is a recombinant of subtype G and subtype A, then the subtype G fragment is the parent of the CRF02_AG fragment and can therefore be expected to be more diverse, with CRF02_AG clustering within subtype G (CRF02_AG monophyletic and subtype G paraphyletic with respect to CRF02_AG). If the alternative hypothesis is true, the opposite scenario is expected (subtype G monophyletic and CRF02_AG paraphyletic with respect to subtype G) (see definitions of monophyly and paraphyly in Materials and Methods). Bayesian inference with MrBayes (27) showed that the second hypothesis was true: CRF02_AG strains were paraphyletic with respect to the monophyletic subtype G strains, indicating that subtype G arose as a separate lineage from the CRF02_AG diversity and not the other way round (Fig. 5).
![]() View larger version (15K): [in a new window] |
FIG. 5. Phylogenetic analysis to discriminate the parent from the recombinant in the genome region bp 3500 to 4000. The Bayesian tree shown was one of the trees generated by MrBayes in one of two independent MCMC runs. The support of the clustering of CRF02_AG and subtype G was analyzed using the "monophyly rules" described in Materials and Methods. The paraphyletic clade (here CRF02_AG) can be considered the parent, and the monophyletic clade (here subtype G) can be considered the recombinant.
|
To assess the validity of the finding in the integrase region, we used a statistical analysis that records the percentage of posterior trees that had this particular paraphyletic relationship of the CRF02_AG-subtype G cluster through the investigation of three "monophyly rules": monophyletic clustering of CRF02_AG plus subtype G, monophyletic clustering of CRF02_AG alone, and monophyletic clustering of subtype G alone (see Materials and Methods for details). Of trees resulting from both MCMC runs, 99.9% fulfilled the rules concordant with CRF02_AG being the parent of subtype G, with only 10 trees in each run (0.1% per run) showing topologies in which CRF02_AG and "subtype G" formed separated clusters. In these trees, our hypothesis was not confirmed, but it was also not contradicted, since we found separate monophyletic clusters for the two lineages. Therefore, none of the trees resulting from either MCMC run suggested that subtype G was the parent of CRF02_AG.
Recombination pattern of "subtype G." Some regions of the "subtype G" genome could not be assigned to either CRF02_AG or subtype J, and for these regions, we hypothesize a recombinant origin from a putative full-length subtype G (similar to what is assumed for CRF01_AE). "Subtype G" could thus be considered an AGJ recombinant, indicated as in Fig. 4a.
|
|
|---|
In our analyses, we included all published full-genome sequences. However, our failure to identify the parental strains of some regions of the subtype G genome suggests that pieces are still missing in the puzzle. Indeed, some of the parental strains may have gone extinct or are as yet undiscovered. We will probably never know the full genetic diversity of HIV at the time of the origin of either CRF02_AG or subtype G. However, our analysis convincingly shows that the current circulating CRF02_AG strains are paraphyletic to the current circulating subtype G strains, so there is no doubt that, for example, for the integrase gene providing the strongest statistical support, the MRCA of the current CRF02_AG strains is ancestral to the MRCA of the current subtype G strains (Fig. 5), indicating that a CRF02_AG-related virus was the parent of the integrase in this recombinant "subtype" G.
Recombination complicates the analysis of the evolutionary history of organisms, as different genomic regions will give discordant results. Here, we show that the high recombination rates observed for HIV can indeed mislead the interpretation of its evolutionary history. Biological interpretations based on the recombinant or nonrecombinant origin of strains should therefore be made with great caution. An example of interpretation based on recombination signal is the current interest in the biological significance of recombination hotspots (16). In such analyses, caution should be taken when assigning the parental strains of the putative recombinants, as the erroneous assignment of parental strains may give rise to misleading results. This is applicable to all viruses known to have high recombination rates and is especially important since most methods for detecting recombination depend on an initial assumption of parental strains.
Finally, our findings urge a reassessment of the HIV-1 evolutionary history. Further detailed analyses will be needed to verify whether the entire notion of "subtype" and "recombinant" applies to HIV-1. As current phylogenetic methods are not capable of accurately reconstructing the evolutionary histories of highly recombinant sequences, it may never be possible to correctly assign for all strains which one is the recombinant and which one is the parent.
Published ahead of print on 6 June 2007. ![]()
Present address: HRC Pathogen Bioinformatics Unit, South African National Bioinformatics Institute, University of the Western Cape, Cape Town, South Africa. ![]()
Present address: Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, United Kingdom. ![]()
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»