Previous Article | Next Article ![]()
Journal of Virology, June 2007, p. 6731-6741, Vol. 81, No. 12
0022-538X/07/$08.00+0 doi:10.1128/JVI.02752-06
Copyright © 2007, American Society for Microbiology. All Rights Reserved.
,
Bruce Crise,2,
Yuan Li,3
Gerald Princler,1
Nicole Lum,4
Claudia Stewart,4
Connor F. McGrath,5
Stephen H. Hughes,1
David J. Munroe,4 and
Xiaolin Wu4*
HIV Drug Resistance Program, NCI-Frederick, Frederick, Maryland 21702,1 Gene Expression Laboratory,2 AIDS Vaccine Program,3 Laboratory of Molecular Technology,4 Target Structure-Based Drug Discovery Group, SAIC-Frederick, Inc., NCI-Frederick, Frederick, Maryland 217025
Received 14 December 2006/ Accepted 29 March 2007
|
|
|---|
|
|
|---|
Cellular cofactors may play important roles in retroviral integration site selection (8, 56). Lens epithelium-derived growth factor (LEDGF/p75) has been shown to bind to HIV integrase (11, 35-37, 53) and to contribute to HIV's preference for integrating into genes (12). Lewinski et al. recently showed that integrase is the principal viral determinant in target site selection (34). In that study, a chimeric HIV virus with an MLV integrase integrated with a target site specificity similar to that of MLV. This suggests that retroviruses with similar integrases should have similar target site preference.
Human T-cell leukemia virus type 1 (HTLV-1), a member of the deltaretrovirus genus, is the causative agent of adult T-cell leukemia and HTLV-1 associated myelopathy/tropical spastic paraparesis (39, 54). HTLV-1 differs from the retroviruses described above, and it provides an opportunity to test the relationship between integrase phylogeny and integration site selection. Although the viral Tax protein is clearly involved in oncogenic transformation, it is still unclear whether HTLV-1 integration sites influence the expression of cellular or viral genes that relate to the development of disease. There have been several studies of HTLV-1 integration sites in the human genome (19, 24, 31, 32, 44), but the number of sites examined in the majority of these studies was small and the integration sites were cloned from chronically infected patients. In one study, Doi et al. characterized 56 HTLV-1 integration sites from carrier cells and 59 sites from leukemia cells (19) and found that in carrier cells, HTLV-1 integration tended to occur in heterochromatin alphoid repeated regions, whereas in leukemia cells, HTLV-1 integration favored actively transcribed genes. This difference may arise from the different selection pressures in carrier versus leukemic cells after virus integration.
Here we examine HTLV-1 integration sites in HeLa cells infected with HTLV-1 vectors that express a reporter gene but no viral proteins. A total of 541 HTLV-1 integration sites were cloned and sequenced from acutely infected HeLa cells, analyzed, and compared with the integration sites for five other retroviruses (ASLV, FV, MLV, SIV, and HIV) in relation to currently available genomic features. Our results show that HTLV-1 integrates into the human genome with little preference for most of the genomic features analyzed, which is similar to the case with ASLV. The integration preferences for the six retroviruses can be separated into three distinct groups based on cluster analysis of integration site preferences. In both the cluster analysis of integration site preference and phylogenetic analysis of integrase proteins, SIV was most similar to HIV and formed one group, FV was most similar to MLV and formed a second group, and HTLV-1 was most similar to ASLV, forming a third group.
|
|
|---|
Analysis of HTLV-1 and other viral integration sites in the human genome. Raw sequences were filtered to select those that had the expected LTR sequence and linker sequences. Sequences were trimmed and aligned to human genome hg18 (University of California, Santa Cruz [UCSC] March 2006 freeze; NCBI build 36.1) using the Blat program (http://genome.ucsc.edu). To be considered an authentic integration site, a clone must meet several criteria: (i) the genome must be matched with >95% identity; (ii) the match must start immediately after the LTR sequence (<5 bp); (iii) the match to the genome must be contiguous with no big gaps; and (iv) if a clone matches multiple genomic sites, the best match is chosen only if it has a Blat score 10 or more higher than the second-best match. With these criteria, we mapped 541 unique HTLV-1 integration sites in the human genome from HeLa cells. Other data sets for HIV, MLV, FV, ASLV, and SIV integration sites were downloaded from GenBank and mapped to the human genome using the same automated program except that a cutoff value of 90% identity was used for SIV integration sites cloned from macaque (25). Customized Perl programs were used to compare localized integration sites to various genomic features. A set of 10,000 random integration sites in the human genome were generated in silico and analyzed together with viral integration sites. All genomic feature tables and chromosome sequences for human genome hg18 were downloaded from the UCSC genome database (http://genome.ucsc.edu/). Multiple data sets for each virus were first analyzed separately. We did not observe any statistical difference between subsets, and the data sets for each virus were pooled.
Cluster analysis of viral integration site profiles and phylogenetic analysis of viral integrase homology. BRB-arrayTools 3.3.0 software (http://linus.nci.nih.gov/BRB-ArrayTools.html) was used to cluster viral integration site profiles. Integration sites for all six retroviruses and random sites were analyzed using a total of 69 genomic features, including genes, CpG islands, GC content, etc. Unsupervised hierarchical clustering was performed using 69 genomic features with Euclidean distance and average linkage.
For phylogenetic analysis, amino acid sequences of viral integrase for all six retroviruses were aligned with the AlignX program based on the Clustal W algorithm in the VectorNTI software suite (Invitrogen). The SwissProt accession numbers are as follows: P14078 (HTLV-1), Q7SQ98 (ASLV), P23074 (FV), P03355 (MLV), P05896 (SIV), and P03366 (HIV). Reverse transcriptase and RNase H sequences were trimmed off. Only the integrase sequences of the POL proteins were used for alignment. An unrooted neighbor-joining tree was generated with Mega3.1 software with 10,000 bootstrap samples (29). A phylogenetic tree was also generated by the GeneBee TreeTop phylogenetic tree prediction server based on a cluster algorithm (http://www.genebee.msu.su/services/phtree_reduced.html) (5). The two trees were very similar.
|
|
|---|
|
View this table: [in a new window] |
TABLE 1. Integration site data sets used in this study
|
![]() View larger version (80K): [in a new window] |
FIG. 1. Palindromic consensus sequences at retroviral integration sites. Base compositions around the integration sites were calculated. Integration occurs between positions –1 and 1 on the top strand. Colored positions have frequencies of bases statistically different from those of randomly generated positions (P < 0.01), which are 30%/20%/20%/30% for A/C/G/T. Bases with a greater than 10% increase of frequency at a position are colored green, and bases with a greater than 10% decrease of frequency at a position are colored red. Inferred duplicated target sites are in the blue box, and DNA strand transfer occurs at positions labeled by arrows. The base preferences show palindromic patterns centered on the duplicated target sites, and the symmetries are marked with the dotted vertical line.
|
|
View this table: [in a new window] |
TABLE 2. Integration frequency near genomic features
|
![]() View larger version (52K): [in a new window] |
FIG. 2. Integration frequencies of HTLV-1 and five other retroviruses near various genomic features. (A) Integration frequency near transcription start sites of Refseq genes. The frequency is shown as the percentage of integration sites adjusted to the density as numbers per kb for each interval near the transcription start sites of Refseq genes. –, denotes region upstream of transcription start sites; +, region downstream of transcription start sites. (B) Integration frequency near CpG islands. The percentage of the integration sites (per kb) is shown for each interval near CpG islands. –, region upstream of CpG islands; +, region downstream of CpG islands. (C) Integration site distribution within Refseq genes. Each gene was conceptually divided into eight equal-size bins from the start to the end of the gene. Integration sites in each bin were added together and plotted for all viruses. (D) Integration frequencies near DNase hypersensitive sites. Integration sites within ±1 kb of DNase I cleavage sites were compared to random expected values. Frequency is represented as the ratio of observed sites/expected sites. Frequency near random sites is represented by the dotted line. (E) GC content near integration sites. The GC content within various sizes of windows near integration sites were computed for each virus and compared to random integration sites. The GC content near random sites reflects that the average GC content for the human genome is close to 41% GC. The GC contents around MLV and FV integration sites are higher than those for random sites, while the GC contents around SIV and HIV integration sites are lower than those for random sites. (F) Gene density around integration sites. Refseq genes found within ±1 Mb of integration sites were averaged for each virus and compared to 10,000 random sites. The dotted line represents gene density around random sites.
|
Integration near CpG islands. CpG islands are thought to be associated with transcriptional start sites in vertebrate genomes (3, 30). We analyzed integration sites of all six retroviruses relative to the random data set for proximity to CpG islands in the human genome (Table 2; Fig. 2B). Again, MLV showed the strongest preference for integration into regions near CpG islands, with 21.5% of integration sites within a ±2-kb window of CpG islands (P < 0.0001). FV showed the second-strongest preference near CpG islands, with 15.2% of integration sites within the same window (P < 0.0001). ASLV showed a slight preference for regions around CpG islands, with 7.6% (P = 0.0001) of its integration sites within the window. HTLV and HIV showed no significant preference compared to that for random sites (5.9%, 3.6%, and 4.3%, respectively). The frequency of SIV integrations near CpG islands (1.3%) was lower than that for random sites, although not statistically significant. For each of the viruses, the integration frequency near CpG islands is in good agreement with the frequency near transcription start sites.
In addition, we used the FirstEF (First Exon Finder) table from the UCSC genome database to estimate the integration frequency near transcription start sites or promoter regions. FirstEF is a program that predicts promoters and 5'-terminal exons. The FirstEF database contains three types of predictions for the human genome: first exon, promoter, and CpG window. The integration frequencies relative to these three features were similar to the data from the RefSeq transcription start sites and the CpG islands for each of the viruses. For MLV, 23.7% of the integration sites were within the ±2-kb window of predicted promoters (P < 0.0001). FV integrated in the same regions at a frequency of 16.4% (P < 0.0001). HTLV and ASLV showed a weak preference for promoter regions (6.8% [P = 0.03] and 8.3% [P = 0.0001], respectively). HIV and SIV showed no preference or a slight avoidance for these regions compared to random sites (HIV, 3.6%; SIV, 1.3%; and random, 4.8%, respectively).
Integration in genes. HIV and SIV were reported to preferentially integrate into genes or transcription units (17, 25, 49, 57). The frequency of HTLV integration into genes or transcription units was compared to those for five other viruses and a random data set (Table 2). Several human gene annotation tables from the UCSC genome database were used for this analysis, including RefSeq genes, Known genes, Ensembl genes, MGC genes, SGP genes, and Genescan genes. We found that regardless of which database was used, a consistent pattern was seen for each virus except in the case of Genescan genes, which are totally computationally predicted. Here we focus on RefSeq genes, because they are well annotated. Our analysis of ASLV, FV, MLV, SIV, and HIV agrees with the published reports. The SIV and HIV proviruses were preferentially integrated into genes, with 80% and 72% of integration sites, within RefSeq genes (P < 0.0001). HTLV, like ASLV and MLV, showed a modest preference for genes, with a ratio of 46.8%, 46.4%, and 45.7%, respectively, in RefSeq genes (P < 0.0001). FV showed no preference for genes; only 32.7% of FV integrations were within RefSeq genes, even lower than the random data set, which has 35.7% within RefSeq genes, suggesting FV may avoid genes as targets (P = 0.002).
We also looked at the distribution of integration sites within RefSeq genes (Fig. 2C). All genes are divided into eight bins, starting from the transcription start site. Integration sites inside genes were placed in those eight bins according to location. The percentage of integration sites was then calculated for each bin. For MLV and FV, the first bin has the highest integration frequency (P < 0.05), reflecting their preference for transcription start sites. For SIV and HIV, the frequency tends to be higher in the middle of genes (second to seventh bins) and lower at both ends of genes (first and eighth bins). HTLV and ASLV showed a roughly even distribution across all eight bins.
Integration near DNase-hypersensitive sites. DNase-hypersensitive sites are believed to be nucleosome-free regions of the chromatin associated with regulatory elements, such as promoters, silencers, enhancers, and locus control regions in the genome (21). Recently Crawford et al. mapped a large number of DNase I-hypersensitive sites in the human genome (15, 16). DNase-hypersensitive sites were enriched upstream of genes, in CpG islands, and in regions that are conserved in multiple species. Most of the DNase-hypersensitive sites were not cell line specific. Figure 2D shows the integration preferences of all six retroviruses within a ±1-kb window of all DNase-hypersensitive sites with a score of 750 (this score correlates with approximately 85% of the valid DNase- hypersensitive sites; NHGRI DNase I-hypersensitive sites track description, http://genome.ucsc.edu/). Among the six retroviruses, MLV showed the strongest preference for integrating near DNase-hypersensitive sites (P < 0.0001), while FV showed a weaker yet still significant preference for DNase-hypersensitive sites (P < 0.0001). HTLV, ASLV, SIV, and HIV showed no significant preference for DNase-hypersensitive sites compared to random sites.
GC content near integration sites. Genomic sequences around integration sites were aligned, and GC content in variously sized windows (50 bp, 100 bp, 200 bp, 500 bp, and 1,000 bp) was computed. Table 2 and Fig. 2E show the average GC content in these windows around the integration sites of all six retroviruses. MLV and FV integration sites have a higher GC content than the random sites in window sizes up to 1 kb (P < 0.0001, Monte Carlo simulation, compared to 100,000 x n sets of random sites, where n is the matched number of integration sites used for each virus). These results may reflect the preferences for CpG islands by MLV and FV. SIV and HIV both have lower GC content surrounding integration sites than for random sites (P < 0.0001). The GC content surrounding HTLV-1 and ASLV sites was similar to that for random sites.
Integration and gene density. The gene densities surrounding the integration sites of all six retroviruses were also calculated. The average number of genes found within 1 Mb of the integration sites (Table 2) for each virus was plotted and is shown in Fig. 2F. All viruses showed an elevated average gene density within a 1-Mb window of the integration sites (P < 0.0001, compared to 10,000 random sites with a t test). The highest gene density was found around SIV integration sites. HIV integration sites had the second-highest gene density. MLV integration sites had the third-highest gene density. Gene densities around HTLV-1, ASLV, and FV integration sites were similar.
Global comparison of integration target site preferences of six retroviruses. From the above analysis of integration sites of six retroviruses, it appeared that HTLV-1 and ASLV integration sites were similar with respect to the integration preferences for genomic features such as transcription start sites, CpG islands, promoters, DNase-hypersensitive sites, genes, gene density, and GC content. So were FV and MLV integration sites, as well as SIV and HIV integration sites. Clustering methods have been commonly used to measure the similarities and differences within and between groups of samples. A machine learning algorithm was recently used by Lewinski et al. to describe the similarity of global integration profiles of HIV, MLV, and HIV/MLV hybrid viruses (34). We performed cluster analysis of the global integration profiles of six retroviruses and the random-site control. This was done by taking into account 69 different genomic features, some of which have been described above (see Table S1 in the supplemental material). Using unsupervised hierarchical clustering, with euclidean distance and average linkage, six viruses and the random sites could be clearly separated into three distinct clusters (Fig. 3). SIV and HIV form one cluster. FV and MLV form a second cluster, while HTLV-1 and ASLV form a cluster with the random sites.
![]() View larger version (33K): [in a new window] |
FIG. 3. Clustering of integration site preferences and phylogenetic analysis of integrases of all six retroviruses. (A) Heat map of clustering of the integration sites for all 6 retroviruses and random sites based on 69 genomic features. (B) Dendrogram based on location of integration in relation to 69 genomic features. Unsupervised hierarchical clustering, with euclidean distance and average linkage was used to generate the dendrogram. (C) Phylogenetic tree based on amino acid sequences of the six retroviral integrases. Bootstrap values for the neighbor-joining method (percentages from 10,000 trials) are shown on each branch. Integrase sequences are from the following POL proteins: P14078 (HTLV-1), Q7SQ98 (ASLV), P23074 (FV), P03355 (MLV), P05896 (SIV), and P03366 (HIV). (D) Phylogenetic tree with additional integrase sequences, including P31822 (FIV), P03365 (MMTV), and AAA35339 (Tf1) integrases, showing the three integrases been placed into different clusters.
|
|
|
|---|
We observed two hot spots (chr11p11.2 and chr11q12.1) for HTLV-1 integration in HeLa cells. The first, on chr11p11.2, is a 162-kb region that had 6 independent integration sites (P = 0.00001 based on 100,000 x 541 Monte Carlo simulations). This is the location for the 5' end of the gene encoding a receptor protein, tyrosine phosphatase J. Tyrosine phosphatase J is present in all hematopoietic lineages and was shown to negatively regulate T-cell receptor signaling (1). The second hot spot is a 100-kb region on chr11q12.1 that had 5 independent integration sites (P = 0.0004, based on 100,000 x 541 Monte Carlo simulations). There are two genes within this region: RTN4RL2 and SLC43A1. We do not know the biological relevance of the hot spots or whether the hot spots were related to the drug selection. Earlier work with HIV also found an integration hot spot in SupT1 cells, but this hot spot did not appear in other cell types studied (49).
Our results show that HTLV-1 integration is nearly random within the HeLa cell genome. The six retroviruses compared here can be placed into three groups, based on the preferences of their integration sites for different genomic features. The groups are characterized by integration sites that are predominantly as follows: (i) near transcription start sites and CpG islands (MLV and FV); (ii) within genes or transcription units (SIV and HIV); or (iii) randomly dispersed (HTLV and ASLV). The same three pairs of retroviruses were clustered together in phylogenetic analyses of their integrase proteins, even though viruses in two of these pairs were from different retroviral genera. These results suggest that the most closely related integrase proteins direct integration into regions of the genome with similar features and that viruses in these different groups use distinct mechanisms to access their integration sites.
It should be possible to predict the global integration profiles of uncharacterized retroviruses based on integrase phylogenies. For example, feline immunodeficiency virus (FIV), which is being used to develop gene therapy vectors (45), has a 5-bp target duplication site. Phylogenetic analysis puts FIV integrase in the same cluster with SIV and HIV, predicting that the integration profile will be similar to that of SIV or HIV. The recent report by Kang et al. on FIV vector integration sites is consistent with this prediction (27). The relationship of integrase phylogeny and integration preference can be extended to certain retrotransposons. The Tf1 transposon from Schizosaccharomyes pombe is an LTR retrotransposon closely related to retroviruses (33). Phylogenetic analysis of Tf1 integrase places it in the MLV/FV cluster (Fig. 3). It has been shown that the Tf1 integration site preference resembles MLV/FV in that Tf1 prefers to integrate in the promoter regions of polymerase II-transcribed genes (4, 50). The integration profile of mouse mammary tumor virus (MMTV), a betaretrovirus that generates a 6-bp target site duplication, has not been determined. Phylogenetic analysis of MMTV integrase (Fig. 3) places it in the HTLV and ASLV cluster, leading to the prediction that MMTV will integrate into the host genome with little preference for any genomic features we have analyzed.
Both cellular and viral factors may contribute to the integration sites selected by retroviruses (6, 7, 56). Cellular factors can cooperate in the targeting of preintegration complexes to specific genomic features (8). LEDGF/p75 binds to HIV integrase (11, 35, 37, 53), increases the efficiency of HIV integration (36), and plays a role in targeting integration into genes (12). In contrast, MLV integrase does not interact with LEDGF/p75 but is likely to target promoter regions by interacting with different cellular factors. The absence of integration site specificity for HTLV-1 and ASLV could be due to interactions with ubiquitous chromosomal proteins or to a lack of interaction with host proteins. Alternatively, we cannot rule out the possibility that the cellular protein or protein isoform that interacts with HTLV-1 or ASLV integrase is not expressed in HeLa cells. Further studies of integration profiles for these viruses in other cell types will be needed to resolve these issues.
Although the interaction between retroviral integrases and the host factors involves the three-dimensional structure of the proteins, as illustrated by lentiviral integrase and LEDGF/p75 (10), alignment of primary sequences of related proteins often reveals important motifs. To identify potential interaction motifs that are shared among integrase proteins, integrase sequences from retroviruses with similar integration site preferences were aligned. Apparent conserved regions were observed (Fig. 4). For instance, alignment of MLV, FV, and other closely related integrases revealed conserved motifs in addition to the HHCC zinc finger motif and the DDE catalytic motif (Fig. 4A). The LTKL motif is probably within the
4 helix of the catalytic domain, based on the comparison of domain structures of MLV and HIV IN (48). Further toward the C terminus, another conserved region can be defined as GxxVxxRxxxxxxLxP(R/K)WxxPxx(V/I)L, where x is any amino acid. This domain was also identified as a conserved domain (the GPY/F domain) in the Ty3/Gypsy class of LTR retrotransposons and some retroviral integrases (38), although the element we have identified varies slightly from the reported GPY/F module. This motif in the Ty3/Gypsy class of retrotransposons was proposed to play a role in directing integration specificity (38). This domain is also present in the Schizosaccharomyes pombe Tf1 element, which has an integration site preference similar to those of MLV and FV, targeting upstream regions of polymerase II-transcribed genes (4, 50). This domain is not found in other retroviral integrases analyzed in this study. Conserved motifs were also observed when HTLV-1 and ASLV families were aligned (Fig. 4B). It will be interesting to see if mutations in these regions alter the targeting specificities of the integrases.
![]() View larger version (105K): [in a new window] |
FIG. 4. Alignment of retroviral integrases within each cluster reveals conserved motifs outside the catalytic core that may interact with cellular targeting factors. Identical amino acids are labeled with a black background. The zinc finger motif (HHCC) and the catalytic core (DDE motif) are labeled with black arrowheads. Conserved residues of other motifs are labeled with white arrow-heads. (A) Alignment of IN from the MLV and FV families revealed additional conserved motifs. The LTKL motif is probably part of the catalytic core, based on a domain structure study comparing HIV and MLV IN (48). Another conserved motif (PxxxGxxVxxRxxxxxxLxP(R/K)xxPxxxL) is found in the C-terminal regions of MLV and FV IN. (B) Alignment of IN from the HTLV and ALV families. The KTxxQxHxxP motif is located on the linker between the catalytic core and the C-terminal domain based on the RSV crystal structure (1C0M). Another conserved motif, WxPW, is found at the ends of the C-terminal regions of HTLV and ALV IN.
|
The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government.
Published ahead of print on 4 April 2007. ![]()
Supplemental material for this article may be found at http://jvi.asm.org/. ![]()
These authors contributed equally to this work. ![]()
|
|
|---|
1 phosphorylation. Mol. Cell. Biol. 21:2393-2403.This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»