Previous Article | Next Article ![]()
Journal of Virology, November 2006, p. 11313-11321, Vol. 80, No. 22
0022-538X/06/$08.00+0 doi:10.1128/JVI.01737-06
Copyright © 2006, American Society for Microbiology. All Rights Reserved.
Biomedical Engineering Interdepartmental Degree Program,1 Departments of Human Genetics and Biomathematics,3 Department of Molecular and Medical Pharmacology, Molecular Biology Institute, and UCLA AIDS Institute, UCLA School of Medicine, Los Angeles, California 900952
Received 10 August 2006/ Accepted 31 August 2006
|
|
|---|
|
|
|---|
In the host cell, since integration of retroviral DNA is inherently a mutagenic event, a better understanding of the process of target site selection may also provide a better assessment of cellular toxicity induced by insertional mutagenesis (3, 29). The information may also be used to optimize retrovirus-based vectors in genetic engineering and therapy. For instance, vectors with an integration site preference for intergenic region may be more attractive for human gene therapy than those favoring integration near transcription start sites or in active genes (12, 30, 31, 42).
The mechanism that determines target site specificity of retroviruses is not well understood and is likely affected by multiple factors. These factors include the virus-encoded enzyme integrase (IN) that catalyzes the integration reaction, target DNA sequence and structure, transcriptional status of DNA, DNA methylation, repetitive elements, and DNA-binding proteins (see references 6, 7, 11, and 19 and references therein). Because of the nonspecific nature of integration, studies on understanding the mechanism and characterizing the factors that control target site selection in infected cells require the collection and analysis of a large library of HIV-1 proviral clones. Although the roles of the aforementioned factors in the selection of DNA sites for integration have been studied to various extents in vitro, characterization of the roles of many of these factors in vivo is either scarce or inadequate or has not been done (6, 19). The conventional means for analyzing integration sites involve sequencing and mapping of one positive clone for each integration site (for examples, see references 17, 26, 31, 37, and 42). As such, integration site analysis is typically time-consuming and labor-intensive, and owing to the relatively small number of integration events that can be analyzed by these methods, associating integration sites with particular genomic regions or individual genes and evaluating the role of any one factor in integration are not trivial tasks.
We have developed and validated a high-throughput, efficient, and unbiased method of sequencing and mapping a large number of independent integration sites. This high-throughput assay should aid the effort in understanding the mechanisms of target site selection and examining the role of viral factors and host cell processes in influencing the choice of target sites during retroviral DNA integration in vivo.
|
|
|---|
Preparation of WT and MmeI-containing viruses. The mutant virus NL-Mme was derived from the wild-type (WT) HIV-1 molecular clone NL4-3 by the introduction of point mutations at positions 630 (T to C) and 632 (G to A) at the 3' end of the left long terminal repeat (LTR) (Fig. 1A) to create the MmeI recognition site. The primers used for mutagenesis are MmeF (5'-CAGTGTGGAAAATCTCCAACAGTGGC) and MmeR (5'-TGTTCGGGCGCCACTGTTGGAGATTT). pNL4-3 or pNL-Mme DNA was transformed into TOP10 cells (Invitrogen), which were cultured at 30°C, and plasmid DNAs were prepared using an Endofree plasmid maxi kit (QIAGEN). All viral stocks were prepared by PolyFect (QIAGEN) transfection of 1 x 106 293T cells with 4 µg of DNA in 25-cm2 flasks (1). Culture supernatants were collected 48 h after transfection and passed by gravity through a 0.45-µm low-protein-binding membrane (Corning). Virions were treated with 200 U of RNase-free DNase I (Amersham Pharmacia) per ml of viral stock in the presence of 10 mM MgCl2 at room temperature for 1 h and stored at 80°C until use. The virus titer was estimated by an enzyme-linked immunosorbent assay (Coulter, Inc.) against the HIV-1 p24 antigen.
![]() View larger version (19K): [in a new window] |
FIG. 1. Assays for genome-wide analysis of HIV-1 integration sites. (A) Construction of mutant HIV with a type IIS MmeI restriction site in the LTR. Bold letters denote viral DNA sequences at the U5 region of the LTR, and italicized letters denote chromosomal DNA. The nucleotides at the positions indicated by the arrows were changed from T and G in the wild-type sequence to C and A, respectively, in the NL-Mme mutant, generating a new recognition site for MmeI (underlined). Arrowheads indicate cleavage sites for MmeI. (B) Schematic diagram outlining the major steps of the conventional assay and the high-throughput Int-tag assay. Viral, cellular, and linker DNAs are denoted by green, red, and black lines or boxes, respectively. Red dotted lines denote cellular DNA with various lengths. Blue ovals represent streptavidin beads, and green diamonds represent biotin. See Materials and Methods for a detailed description of the experimental procedures.
|
(i) Conventional assay.
The streptavidin-bound Int-DNA was digested with NspI (PuCATG
Py), a 6-bp cutter that produces DNA fragments on average of about 1 kbp in length (Fig. 1B). The digested products were ligated with a short DNA linker (NP linker), which was prepared by annealing BHLinkA (5'-CGGATCCCGCATCATATCTCCAGGTGTG) with NPLink (5'-CACCTGGAGATATGATGCGGGATCCGCATG). The NP linker contains a BamHI site (underlined) and a 4-nt 3'-overhang (in bold type) complementary with the NspI-digested Int-DNA fragments. The ligated products were amplified by PCR using U51 (5'-TGGCTAACTAGGGAACCCACT) and NP1 (5'-TCACACCTGGAGATATGATGCG) as the forward and reverse primers, respectively. U51 anneals to nt positions 9570 to 9590 of the viral U5 end, whereas NP1 anneals to the NP linker. The PCR amplification was carried out in a final volume of 200 µl with 0.5 µM of each primer, 0.2 mM of dNTPs,
103 pmol of template DNA, and 10 U Herculase DNA polymerase under the following conditions: 2 min of preincubation at 94°C, followed by 27 cycles at 94°C for 30 s, 58°C for 30 s, and 72°C for 5 min. The reaction mixture was then incubated for 10 min at 72°C for a final extension. The PCR products were separated on a 2% agarose gel. DNAs over 100 bp in length were isolated using a gel extraction kit (QIAGEN) and cloned using a Zero Blunt TOPO PCR kit (Invitrogen).
(ii) High-throughput Int-tag assay. The streptavidin-bound Int-DNA was digested by MmeI and then ligated with a 28-bp DNA linker (BH linker) containing a randomized 2-nt 3'-overhang and a 3-nt 5'-overhang in the minus strand (Fig. 1B). The 2 nt at the 3'-overhang were randomized so that the linker could base pair and ligate with the MmeI-digested Int-DNA fragments. The BH linker was prepared by annealing BHLinkA with BHLinkB (5'-TGTCACACCTGGAGATATGATGCGGGATCCGNN). BHLinkB was synthesized to contain a uniform distribution of the 16 possible combinations for two sequential random nucleotides. The random nature of the 2 nucleotides at the 3' end was confirmed by electrospray ionization mass spectrometry (data not shown). The BH linker contains a BamHI site (underlined) and is not phosphorylated to avoid self-ligation.
The linker-ligated DNA was amplified by a two-step PCR using the forward primer BMF (5'-TCAGACGGATCCAGTCAGTGTGGAAAATCTCC) and the reverse primer NP1. BMF contains a BamHI site (underlined) and anneals to 4 nt upstream of the viral U5 end. Approximately 104 pmol of template DNA was added to a final volume of 1 ml of 1x PfuUltra buffer containing 1 µM of BMF and NP1 primers, 0.2 mM of dNTPs, and 20 U PfuUltra DNA polymerase. The reaction mixture was aliquoted into a 96-well plate for the amplification. The first-round-PCR condition was 2 min of preincubation at 94°C, followed by 27 cycles at 94°C for 30 s, 58°C for 30 s, and 72°C for 1 min, and then a final extension of 10 min at 72°C. After PCR, the reaction mixture was concentrated by ethanol precipitation and separated on a 14% native polyacrylamide gel. The 79- and 80-bp products were extracted and subjected to 20 additional cycles of linear amplification under the condition of the first-round PCR. The amplified DNA was digested with BamHI to form 46- or 47-bp fragments (termed Int-tags) with a 4-nt 5'-overhang at each end (Fig. 1B). Each Int-tag has 25 bp of viral end DNA, 19 or 20 bp of cellular DNA, and 2 bp of linker sequence. The digestion mixture was concentrated and separated on a 14% native polyacrylamide gel at 4°C and 150 V to prevent denaturation of Int-tags. The Int-tags were excised and extracted from the gel and concatemerized by ligation using 10,000 U of T4 DNA ligase (New England BioLabs) in a total of 1 ml reaction mixture incubated at 16°C for 4 h. The reaction mixture was then concentrated to 20 µl and separated on a 12% native polyacrylamide gel. DNA products longer than 400 bp were isolated. Concatemerized Int-tags were cloned into circularized pCR4Blunt previously cut with BamHI. Insertion of concatemerized Int-tags into the BamHI site within pCR4Blunt would disrupt the toxin-producing ccdB gene (14) and thus provide positive selection for clones containing Int-tag concatemers.
Sequence analysis and mapping integration sites. The sequence of the cloned DNA was determined by dideoxy sequencing, and sequencing ambiguities were resolved by repeated sequencing on both strands. Cellular DNA sequences obtained from each authentic integration site were cataloged using the software program MacVector 7.1.1 (Oxford Molecular). The sequences were aligned and searched for consensus sequence and features by use of AssemblyLIGN, with the threshold parameter set at 40%.
The chromosomal location of the integration site sequence was mapped to the human genome (Human May 2004 [hg17] assembly, National Center for Biotechnology Information [NCBI] Build 35) by use of the BLASTN program (http://www.ensembl.org) or BLAT (University of California, Santa Cruz; http://genome.ucsc.edu). Transcription units in the vicinity of the integration sites were identified using the RefSeq gene database (NCBI Reference Sequence Project; www.ncbi.nih.gov/RefSeq/). Similarities to repetitive sequences were ranked using the Smith-Waterman parameter generated by Repeat Masker (http://www.repeatmasker.org/).
Statistical analysis of integration site sequences. All statistical analyses were conducted using Stata statistical software (Stata Corp., College Station, TX; www.stata.com). To test for differences in proportions, we used rxc contingency table analysis (by Fisher's exact test when individual cell counts were small [<10] or by chi-square approximation). To test for equality of distribution, we used the two-sample Kolmogorov-Smirnov test.
Nucleotide sequence accession numbers. The GenBank accession numbers for integration sites sequenced in this study are EF035624 through EF035928 for HIV WT and EF035929 through EF036245 for NL-Mme. The integration site sequences shorter than 50 bp discussed in this paper have been deposited on our laboratory website (http://labs.pharmacology.ucla.edu/chowlab/Web/).
|
|
|---|
![]() View larger version (10K): [in a new window] |
FIG. 2. Replication kinetics of WT and NL-Mme viruses. CEM cells were infected with equal amounts of the p24 equivalent of WT ( ) or NL-Mme ( ) virus at an MOI of 0.001. The culture media were monitored for p24 levels (ng/ml) at the indicated time points postinfection.
|
A total of 309 and 323 integration sites from cells infected with the WT and NL-Mme viruses, respectively, were analyzed and mapped to the human genome by use of BLAT. For both WT and NL-Mme viruses, integration events were found in all 23 human chromosomes (22 autosomes and the sex chromosome X) (Fig. 3). Similarly to previously published reports (26, 37), the frequencies of integration of WT HIV-1 were quite different among the different chromosomes in comparison to uniformly random integration (P = 7.07 x 1010). Notably, chromosomes 12, 17, and 19 were significantly overrepresented (P values of 0.0410, 0.0008, and <0.0001, respectively), while chromosomes 8 and X were significantly underrepresented (P values of 0.0126 and 0.0020, respectively). The frequencies of integration events of NL-Mme in different chromosomes were not different from those of the WT (P = 0.282) (Fig. 3).
![]() View larger version (15K): [in a new window] |
FIG. 3. Distribution of WT and NL-Mme virus integration events in human chromosomes. Results are expressed as the percentages of integration events in each chromosome. Human chromosome numbers are indicated at the bottom of the figure. The numbers of integration events for the random control (open bars), WT virus (gray bars), and NL-Mme virus (black bars) were 5,000, 309, and 323, respectively.
|
![]() View larger version (14K): [in a new window] |
FIG. 4. Analysis by the conventional assay of chromosomal features associated with WT and NL-Mme virus integration events. CEM cells were infected with WT or NL-Mme virus at an MOI of 10. Integration sites were mapped using the conventional assay, and chromosomal features associated with WT (gray bars) and NL-Mme (black bars) proviruses were analyzed. The results are expressed as percentages of total integration events and compared with those of the random control (open bars). Chromosomal features analyzed include transcription units (TU), Alu and mammalian interspersed repeat (MIR) of the SINE, L1 and L2 of the LINE, LTR-E, and DNA-E.
|
Under our Int-tag assay conditions, the number of concatemerized Int-tags per clone ranged from 3 to 10, with an average of 5 Int-tags/clone. We sequenced a total of 515 Int-tags: 385 were authentic integration site sequences, 27 did not yield a high-quality match to the human genome, and 103 contained viral sequences downstream of the left LTR, which presumably derived from LTR circles. Of the 385 that yielded authentic integration site sequences, 194 (50.4%) contained 19 bp and 191 (49.6%) contained 20 bp of host DNA sequence.
To test the ability of mapping chromosomal locations by use of short DNA sequences, we carried out a simulation experiment with 10,000 randomly selected 19- and 20-bp cellular sequences. The positions of the 22 autosomal chromosomes plus the X sex chromosome were represented by adding their lengths together linearly, and uniformly random positions within the human genome were selected by choosing a random number between 0 and 3,070,128,058 (genome size). We found that 68.4% of these sequences were mapped to unique locations in the human genome (Rdm-tag) (Fig. 5). To further determine the ability of the short sequences to identify chromosomal locations, we generated conventional tags (Cvt-tags) based on the 323 NL-Mme integration sites obtained earlier using the conventional assay. The integration site sequences were divided as randomly and proportionally as the Int-tags (50.4%:49.6%), and the first 19 or 20 nucleotides immediately adjacent to the viral DNA of each integration site sequence were used to map the chromosomal location by use of the BLASTN program. Similarly to the simulation with the computer-generated Rdm-tags, 69.3% of the Cvt-tags were mapped to unique locations (Fig. 5).
![]() View larger version (11K): [in a new window] |
FIG. 5. Mapping of short integration sequences to unique locations and those in identifiable repeat elements. Integration site sequences from cells infected with NL-Mme were determined using the conventional assay (Cvt) or the high-throughput assay (Int-tag). Each sequence was mapped to either unique locations (black bars) or identifiable repeat elements (gray bars), and the results are expressed as percentages of total integration site sequences. The results were compared with those obtained from conventional tags (Cvt-tag), generated by taking the first 19 or 20 nucleotides immediately adjacent to the viral DNA of each integration site sequence determined by the conventional assay, or a random library of 19- and 20-bp cellular sequences (Rdm-tag) generated in silico.
|
For sequences that mapped to two or more chromosomal locations, a majority belonged to one of the several repeat sequence families (32). In all three independently derived tag sequences, although we could not determine the chromosomal location of
30% of the tag sequences, repeat families (e.g., LINE and Alu) associated with 57% of these "multiple hit" sequences could be identified (Fig. 5). Overall,
70% of 19- and 20-bp tag sequences can be mapped to a unique location in the human genome, while the chromosomal features associated with
85% of these tag sequences can be identified (Fig. 5).
Validity of the high-throughput Int-tag assay.
To determine the validity of the Int-tag assay, we compared the chromosomal features associated with the integration sites as determined by the Int-tag assay with those as determined by the conventional assay (Fig. 6). For additional comparison and to account for potential effects from using short sequences, we also analyzed Rdm-tags and Cvt-tags, which are derived from random cellular sequences and integration site sequences by use of the conventional assay, respectively. Sequences that mapped to a unique location in the human genome (
70% of total [Fig. 5]) were used for analyzing transcription units (Fig. 6A), and tag sequences that mapped to identifiable chromosomal features (
85% of total [Fig. 5]) were used for analyzing repeat elements (Fig. 6B). In the random control, the distribution of computer-generated integration sites in transcription units (31.4%) and various repeat elements paralleled the relative levels of abundance of the elements in the human genome (25). The distribution of the Rdm-tag sequences was similar to that of the uniformly random integration sites (P = 0.125), indicating that the use of short tag sequences did not significantly alter the analysis (Fig. 6). In contrast to the random control, the distribution of NL-Mme integration sites determined by the conventional assay was significantly favored in transcription units (60.1%, P < 0.0001) and Alu elements (16.4%, P = 0.0035) and disfavored in LTR-E (5.3%, P = 0.0193) and the L1 member of the LINE class (11.8%, P = 0.0093) (Fig. 6). Such a pattern of integration distribution was similar to that published previously using similar methods of analysis (37, 42). The distribution of Cvt-tag sequences showed a bias similar to that of their full-length counterparts (P value of 0.527 for the comparison of integration patterns for Cvt and Cvt-tags), further confirming that analyzing chromosomal features associated with integration sites was not affected by using short tag sequences. For integration sites determined by the Int-tag assay, we also found a significant preference for transcription units (58.4%, P < 0.0001) and Alu elements (12.7%), while LTR-E (5.4%, P = 0.0333) and the L1 repetitive elements (13.6%) were disfavored (Fig. 6). Although the Int-tag assay also detected similar preferences for Alu elements (12.7%) and for L1 repetitive elements (13.6%), the results were not statistically different (P values are 0.3605 and 0.0808, respectively), probably because of the relatively small sample size. Overall, we did not detect any significant differences in the patterns of integration distribution of NL-Mme between the conventional and Int-tag assays (P = 0.892).
![]() View larger version (19K): [in a new window] |
FIG. 6. Analysis of chromosomal features associated with integration events by use of the high-throughput Int-tag assay and the conventional assay. Integration site sequences were determined using the conventional assay (striped bars) or the Int-tag assay (black bars) or generated in silico (open bars). Rdm-tag (gray bars) and Cvt-tag (hatched bars) sequences were derived from integration site sequences generated in silico and by the conventional assay, respectively, as described above. The locations of the integration sites were mapped, and chromosomal features in the vicinity of the integration sites were identified. (A) Transcription units. Integration site sequences that mapped to a unique chromosomal location were analyzed and scored as a part of a transcription unit only if the transcription unit was a member of the RefSeq genes. (B) Identifiable repeat sequences. Integration site sequences that mapped to multiple locations were analyzed for identifiable repeat elements, including Alu and mammalian interspersed repeat (MIR) of the SINE, L1 and L2 of the LINE, LTR-E, and DNA-E.
|
![]() View larger version (24K): [in a new window] |
FIG. 7. Base preference in genomic sequence immediately adjacent to integration sites. The numbers on the x axis represent nucleotide positions of human DNA adjacent to the proviral DNA, where the point of joining between the HIV and human DNA lies to the left of position 0. The height of the bar represents the percent frequency of each base. A, T, G, and C are denoted by dark gray, light gray, black, and open bars, respectively. The preferred sequence is listed on the top of each panel. H denotes A, C, or T but not G; M denotes A or C; N denotes A, C, G, or T; R denotes A or G; W denotes A or T; and Y denotes C or T.
|
|
|
|---|
A full library of 19-bp sequence tags has a complexity of 2.75 x 1011, which should be sufficient to map any 19-bp tag to a unique address in the human genome of 3.07 x 109 bp. However, the human genome is AT rich (
60%), and at least 50% of the genome consists of repeat sequences (25). Our simulation exercise using a mixture of 10,000 computer-generated 19- and 20-bp sequence tags showed that, even with the uneven distribution of bases and high content of repetitive elements, about 70% of the short tags can be mapped to unique locations in the human genome. By converting the integration site sequences in HIV-infected cells obtained by the conventional assay into 19- and 20-bp tags, we confirmed that 70% of these tag sequences were mapped to the same unique locations. Similarly, 70% of tag sequences derived from the Int-tag assay had a unique chromosomal address.
As expected, a majority of the tag sequences that mapped to multiple locations is associated with repetitive elements. This is consistent with an earlier genome-wide study of 524 HIV-1 integration sites showing that about 30% of proviruses are located near repeat sequences (37). Repetitive elements are grouped into five major classes: (i) transposon-derived (interspersed) repeats, (ii) processed pseudogenes, (iii) simple sequence repeats, (iv) segmental duplications, and (v) pericentromeric and subtelomeric tandem repeats. Over 90% of human repeat sequences are related to or derived from transposable elements, such as LINEs, SINEs, LTR-E, and DNA repeat elements (DNA-E) (32). Although about 30% of the tag sequences, generated either in silico or from the conventional and Int-tag assays, did not provide a unique address, it is important to note that we could classify about half of these tag sequences among the transposon-derived repeats and simple sequence repeats. Therefore, our analyses showed that about 70% of 19- and 20-bp tag sequences can be mapped to unique locations, while 85% can be identified by their chromosomal features.
Genome-wide analysis of integration sites indicates that HIV-1 favors transcription units and Alu elements, which are abundant in gene-rich chromosomal domains, and disfavors LTR-E, which are depleted in gene-rich regions of the genome (30, 37, 42). Similar integration site preferences were observed with NL-Mme, indicating that introduction of the MmeI restriction site in the U5 region of the HIV-1 LTR had no measurable effect on the resulting virus. An identical integration preference was also observed when the genome-wide distribution of integration sites of NL-Mme in human cells was analyzed using the Int-tag assay. Furthermore, the preferred integration site sequence as determined by the Int-tags resembles closely those reported previously (8, 18, 40). Therefore, the Int-tag assay is a valid approach in determining integration site sequences, mapping integration site locations, and identifying chromosomal features associated with the integration event.
Integration of retroviral DNA occurs at many sites within the host cell genome, but the process is not uniformly distributed. The site of integration has significant implications for both the virus and the host cell. Therefore, it is important to gain a better understanding of the distribution and preference of integration sites and factors that affect the site selection process. Although reliable assays have already been established for sequencing and mapping integration sites, studies on integration site choice and its determining factors involve the collection and analysis of numerous libraries, with each consisting of hundreds or thousands of independent integration sites. The availability of the described high-throughput assay will make the process less labor-intensive, less time-consuming, and more cost-effective. In addition to HIV-1, the described methodology can be adapted easily to integration site studies involving other retroviruses and transposons (7, 34).
This work was supported by National Institutes of Health grant CA68859 and a seed grant from the UCLA AIDS Institute (NIH grant AI28697) to S.A.C.
Published ahead of print on 13 September 2006. ![]()
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»