Previous Article | Next Article ![]()
Journal of Virology, December 2007, p. 13050-13056, Vol. 81, No. 23
0022-538X/07/$08.00+0 doi:10.1128/JVI.00889-07
Copyright © 2007, American Society for Microbiology. All Rights Reserved.

Tulio de Oliveira,3,4,
Andrew Rambaut,5
Oliver G. Pybus,3
David Dunn,6
Anne-Mieke Vandamme,7
Paul Kellam,1
Deenan Pillay,1,8 on Behalf of the UK Collaborative Group on HIV Drug Resistance
Department of Infection, University College London, London, United Kingdom,1 Division of Infectious Diseases, Stanford University, Stanford, California,2 Department of Zoology, University of Oxford, Oxford, United Kingdom,3 The South African National Bioinformatics Institute, University of the Western Cape, Bellville, South Africa,4 Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, Scotland,5 Medical Research Council Clinical Trials Unit, London, United Kingdom,6 Rega Institute for Medical Research, K.U. Leuven, Belgium,7 Centres for Infection, Health Protection Agency, Colindale, United Kingdom8
Received 25 April 2007/ Accepted 17 September 2007
|
|
|---|
|
|
|---|
Founder effects accompanying the spread of HIV-1 infection have generated an uneven distribution of M group strains among geographic areas and exposure risk populations (27). Consequently, strains often exhibit specific associations with particular geographic regions and/or modes of transmission (8, 18, 20, 32). Tracking these dynamic associations through surveillance of genetic diversity will facilitate epidemiological investigations and inform public health strategies for the prevention of viral spread (5, 16, 18, 43).
The introduction of resistance test sequencing as a standard component of clinical care in many countries provides an abundant source of routinely generated sequence data for the HIV-1 pol gene. Such wide-scale generation of HIV sequence data enables us to use phylogenetic approaches to investigate epidemiological hypotheses that cannot be tackled using nongenetic surveillance data alone (16). In particular, it is essential to determine which HIV strains are being introduced into countries, whether these strains are spreading through ongoing transmission, and if so, through what exposure risk.
In a recent analysis of pol gene sequence data representing approximately one-fifth of all United Kingdom infections, we revealed the extensive range of HIV-1 genetic diversity present within the United Kingdom (12). Here, we use phylogenetic methods to explore this diversity in depth, showing how epidemiological information can be extracted from sequence data sets of this type. In addition, our analysis provides new, fine-scale information about the geographic structuring of the global HIV-1 epidemic.
|
|
|---|
In the United Kingdom, viral sequences may be obtained for resistance testing prior to initiating antiretroviral therapy, as well as in response to treatment failure. For all patients in this study, samples taken prior to initiating therapy were preferentially used. Where no pretreatment sample was available, the earliest posttreatment sample was used. In total, 2,821 sequences (50%) were obtained from patients reported as antiretroviral naive at the time of sampling, 2,750 were from patients with some previous treatment history, and 110 were from cases whose treatment history was unknown. The prevalence of drug resistance mutations in the data set was determined using the calculated population resistance (CPR) tool (cpr.stanford.edu/cpr/). In summary, 2,219 sequences (38.2%) had one or more surveillance drug resistance mutations (29). The prevalence of surveillance drug resistance mutations was highly skewed towards mutations at five positions (41, 67, 70, 103, 184, and 215 in RT). Mutations at 12 other positions occurred at a 2 to 9% prevalence (positions 46, 54, 82, 88, and 90 in PR and 70, 74, 101, 181, 190, 210, and 219 in RT). Surveillance drug resistance mutations at other positions occurred at a <2.0% prevalence.
Strain-level classification of HIV-1 diversity. Nine hundred seventy-nine complete-genome sequences sampled from distinct infections and annotated by country of sampling were used to derive a fine-scale classification of global HIV-1 genetic diversity that reflected geographic associations. Sequences were obtained from the Los Alamos HIV-1 sequence database (www.hiv.lanl.gov) and aligned using a combination of automated protocols and manual adjustment. This data set included representatives of all "pure" HIV-1 M group subtypes (A to D, F to H, J to K) and circulating recombinant forms (CRFs) 01 to 14, 18, and 19. Alignments were used to construct neighbor-joining (NJ) phylogenies with gamma rate distributions and the Hasegawa-Kishino-Yano evolutionary model implemented in PAUP (33). We defined epidemiologically distinct "strains" as monophyletic clusters within established subtype/CRF groupings that comprised two or more sequences sharing the same geographic origin, as defined by country of sampling (countries were grouped into geographic regions according to the classification used by UNAIDS [14]).
The inclusion of recombinant sequences within the data set meant that phylogenetic reconstruction could misrepresent evolutionary relationships, which would be correctly represented by a network rather than a tree. However, the aim here was not to accurately reconstruct the deeper evolutionary history of HIV-1 M group lineages but rather to distinguish more-recent lineages comprising distinct groups of closely related genomes sharing specific geographic associations. To confirm that groups identified in complete-genome phylogenies could be recovered using subgenomic regions, bootstrapped phylogenies were reconstructed using the gag, pol, and env genes, and a concatenated region of the alignment representing a minimum-length resistance test sequence (codons 1 to 99 of PR and 40 to 240 of RT), and 1,000 bootstrap replicates.
Batch genotyping of HIV-1 pol gene sequences. HIV-1 pol gene sequences were assigned to genotypic groups using a modified version of the Rega automated genotyping tool (http://www.bioafrica.net/subtypingclusters.html). With the Rega tool, each query sequence is analyzed separately with a set of reference sequences representing the various genotypic groups for the sequence under analysis and with a series of phylogenetic procedures incorporating bootstrapped NJ analysis, bootscanning, and likelihood mapping-based analysis of phylogenetic content (4). Assignment of sequences to genotypic groups is based on phylogenetic criteria, including bootstrap support values and the topological relationships of query sequences relative to reference sequences (see reference 4 for details). Sequences that do not fulfill criteria for confident assignment to any group represent poorly characterized diversity and are designated unclassifiable.
Ninety-six pol gene reference sequences (including two or more representatives of each global strain) were selected from the original 976 complete-genome sequences used for strain definition. Sequences were classified first according to established subtype/CRF definitions (in this case, assignment to CRFs required identification of recombination breakpoints in the pol region analyzed) and second according to the fine-scale, strain-level classification that we had defined by analysis of complete-genome sequences (see above).
Investigation of transmission dynamics. Phylogenetic analysis can indicate whether imported HIV-1 strains are spreading through ongoing local transmission. If strains are not spreading locally, then in a phylogeny representing both (i) sequences from imported strains and (ii) reference sequences representative of HIV-1 genetic diversity in the putative region(s) of original infection, we expect that imported sequences would be randomly distributed among reference sequences, reflecting separate importation events. If, however, we observe larger monophyletic clusters of imported sequences, this suggests either the presence of a local transmission chain or an importation process heavily biased towards closely related strains.
A reference data set comprising 3,201 globally sampled pol gene sequences (annotated by country of sampling) was obtained from the Los Alamos HIV sequence database. These sequences were classified by batch genotyping in Rega using the strain-level classification described above. Next, an NJ phylogeny was constructed using a combined United Kingdom and global pol sequence data set. For each strain represented in the phylogeny, the statistical significance of clustering among United Kingdom sequences relative to global reference sequences was assessed using a method based on the phylogenetic principle of parsimony (Slatkin-Maddison test [31]). Given a phylogeny in which each tip has been designated on the basis of a given state (in this case, sampling within or outside the United Kingdom), the parsimony algorithm can be used to estimate the minimum number of state changes needed to give rise to the observed distribution of states across a given region of the phylogeny (22). For clusters comprising five or more United Kingdom sequences, states were randomized among all sequences of the same strain, and for each randomization, the minimum number of state changes within that clade was calculated using parsimony. The total number of changes was summed across all 100 randomizations and divided by the number of replicates, giving the expected number of changes under the null hypothesis of random-distribution sequences among countries. The difference between the observed and expected number of changes calculated in this way indicated the significance of clustering.
Investigation of the epidemic history of transmission clusters was carried out using Bayesian evolutionary analysis sampling trees (BEAST) (7). The most appropriate demographic model for each cluster was selected using the likelihood ratio test. A constant molecular clock was applied using mutation rates calculated previously for the same gene region of HIV-1 (18, 21). Demographic and evolutionary parameters of the epidemic, together with their confidence intervals, were estimated by Bayesian Markov chain Monte Carlo inference using a chain of 10 million states sampled every 100th generation. The estimated parameters included the date of the most recent common ancestor of the cluster and the likelihood values of the sampled trees.
|
|
|---|
![]() View larger version (33K): [in a new window] |
FIG. 1. Strain-level classification of HIV-1 genetic diversity. Thirty-six strain-level groupings were defined by phylogenetic analysis of sequences from the complete HIV-1 genome and were recoverable using subgenomic fragments. In this example, phylograms illustrate that due to the presence of recombinant strains, the internal topologies of phylogenies (shown as dotted lines) differ depending on which subgenomic region is analyzed. Thus, the CRF03 strain illustrated, highlighted in trees by shaded circles, can be seen to group with subtype A in trees constructed using gag (A) and with subtype B in trees constructed using pol (B). However, the classification that we describe here focuses on closely related sequences sharing specific geographic associations and does not seek to represent deeper evolutionary relationships. As such, it is robust to the effects of recombination; CRF03 and all other CRFs included in the initial data set were recovered as robustly supported monophyletic lineages in trees constructed using subgenomic regions.
|
|
View this table: [in a new window] |
TABLE 1. Strain-level assignment of 5,675 HIV-1 pol sequences from the United Kingdom and distribution among nationalitiese
|
HIV-1 strains may be introduced into a region either by migration of infected individuals from areas where the strain is prevalent or by infections acquired by locals while traveling in those areas (19, 26, 37). In general, the demographic profiles of infected individuals matched the geographic associations of the HIV-1 strain with which they were infected (Table 1). Thus, the majority of subtype B infections were found in British patients, and other, more recently introduced strains were most commonly found in migrants from countries or regions associated with those strains. The most notable exception to this pattern was CRF01, which is prevalent in Southeast Asia (especially Thailand, Cambodia, and Vietnam [14]); almost half of the individuals infected with this strain for whom data were available were white heterosexuals from the United Kingdom. An elevated prevalence (>10%) within British nationals was also observed for the East African strain of subtype A, the southern/East African strain of subtype C, and CRF02 (Table 1). Among divergent, unclassifiable viruses for which patient demographic data were available, 67% were identified in individuals originally from sub-Saharan Africa.
A phylogenetic approach was used to investigate whether recently introduced strains might be spreading through ongoing transmission within the United Kingdom. In phylogenies representing both sequences sampled within the United Kingdom and sequences sampled globally, sequences belonging to strains other than subtype B and sampled in the United Kingdom were generally interspersed among sequences sampled elsewhere (data not shown), suggesting a random process typical of separate importation events occurring either through migration or through infection of British nationals while traveling abroad. However, three clusters were identified for which statistical analysis suggested potential ongoing transmission within the United Kingdom (Fig. 2). All three clusters comprised a mixture of resistant and nonresistant sequences, and there was no obvious tendency for sequences to group together according to the resistance mutations that they contained. Within each cluster, the only surveillance drug resistance mutation observed in more than one sequence was K103N (observed in two subtype G sequences), and clusters could be robustly recovered if this position was excluded from phylogenetic analysis.
![]() View larger version (19K): [in a new window] |
FIG. 2. Clusters of epidemiologically linked HIV-1 infections in the United Kingdom involving recently introduced strains. Epidemiologically linked infections were identified by statistically significant (P < 0.01) clustering of sequences sampled in the United Kingdom (UK) relative to globally sampled sequences in phylogenies representative of characterized diversity among the HIV-1 strains involved (East African subtype A [97 sequences] and West African/Iberian subtype G [56 sequences]). These clusters are shown in the phylogeny above with a representative set of globally sampled sequences. Bootstrap support for clusters was >70%. Filled circles indicate sequences from the United Kingdom. Globally sampled reference sequences are labeled according to subtype and country of sampling. The estimated date of the most recent common ancestor (MRCA) is shown for each cluster.
|
The third cluster was comprised of 12 subtype G sequences obtained from individuals of either Portuguese or Angolan origin, half of whom were intravenous drug users. Although local transmission cannot be ruled out, the demographic data in this case suggested that clustering could reflect biased import, possibly via a network of intravenous drug users in the United Kingdom sharing a connection with Portugal.
Coalescent analysis indicated that all three clusters were established approximately 12 to 20 years ago (Fig. 2). This compares with estimates of
30 years for the established subtype B lineages in the United Kingdom (18).
|
|
|---|
In this analysis, we develop a pragmatic approach to the classification of HIV-1 genetic diversity that is tailored to the purposes of epidemiological surveillance. Annotated complete-genome sequences were used to derive groupings focused on representing shared geographic associations among closely related strains, rather than attempting to definitively represent evolutionary relationships. This classification is unique in that it is robust to the effects of recombination. Although it was essential that complete genomes were used for the initial characterization of global diversity (as only complete or nearly complete sequences can definitively discriminate monophyletic lineages in a population that is continuously being intermixed by recombination), reconstruction of phylogenies using subgenomic regions demonstrated that all groups defined by complete-genome analysis could be recovered using only these regions and the principles that we apply (Fig. 1). Thus, subgenomic fragments could be assigned to recombinant strains. Of course, when assigning subgenomic sequences to strains in this way, we could not rule out the possibility that certain of the sequences were misclassified due to uncharacterized recombination in the parts of the genome that were not analyzed. However, this would not invalidate the epidemiological information that we aim to infer, as robust genetic relatedness between strains, even in a subgenomic fragment, implies an epidemiological link. Furthermore, providing that prevalent recombinant viruses continue to be reported and fully sequenced with regularity, such cases will be relatively rare and unlikely to qualitatively affect our conclusions.
Fine-scale classification of 5,675 pol gene sequences using strain-level groupings revealed the underlying complexity of the HIV-1 epidemic in the United Kingdom. For example, for subtypes A, C, and G, the epidemic within the United Kingdom reflects the introduction of infections from two or more geographically distinct populations (Table 1). Reference to patient data confirmed that the geographic associations inferred through strain-level classification were generally concordant with the demographic profiles of infected individuals, which reinforces the epidemiological identities of the strains that we define.
A phylogenetic exploration of transmission dynamics indicated that the majority of non-B infections in the United Kingdom reflect separate introductions through travel and migration. However, the power of a molecular phylogenetic approach to detect epidemiological shifts of potential significance was illustrated by the identification of two transmission chains involving subtype A strains that are usually associated with heterosexual infections acquired in East Africa (Fig. 2). The observation that >90% of the sequences in these clusters were obtained from patients whose exposure category was defined as sex between men indicates that subtype A, or a novel recombinant epidemiologically linked to it, is spreading within the United Kingdom via the route of men having sex with men. The estimated origin of these two clusters in the late 1980s and early 1990s is concordant with existing epidemiological data, as this period was when the African epidemic was growing at its fastest rate and was prior to the widespread rollout of highly active antiretroviral therapy in the United Kingdom.
HIV-1 pol gene sequence data obtained during routine genotypic resistance testing is increasingly abundant in many countries throughout the world. This report illustrates how such opportunistically sampled data can be employed to monitor changes in the molecular epidemiology of national HIV epidemics. Since many sequences are obtained from treated patients (about half of the sequences in our data set), and some level of transmitted drug resistance is likely to be present within the untreated population, these data sets inevitably contain some level of homoplasy (i.e., convergent/parallel evolution) introduced by drug selection. However, previous studies have demonstrated that such data nevertheless contain sufficient phylogenetic signal for subtype assignment and for reconstruction of transmission history (16, 43), and this was reconfirmed here. For the latter purpose, we emphasize the need for careful analysis with respect to shared resistance mutations (17).
As the classification that we developed in this report is dependent on annotated full-genome sequences, the 36 geographic strains that we identify likely represent an under-sampling of HIV-1 diversity. However, given current trends toward more-efficient high-throughput sequencing technologies, we anticipate that more-representative sets of HIV-1 genome sequences will become available in the future. This will allow a richer characterization of the epidemiological associations among strains and enable further detailed characterization of epidemics to be carried out using the framework implemented in this report.
The UK HIV Drug Resistance Database is partially funded by the Department of Health.
The views expressed here are those of the authors and not necessarily those of the Department of Health.
We thank all the clinicians, virologists, data managers, and research nurses in participating centers who assisted with the provision of data.
Published ahead of print on 26 September 2007. ![]()
These authors contributed equally. ![]()
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2010 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»