**DOI:**10.1128/JVI.02216-06

## ABSTRACT

The fate of most human endogenous retroviruses (HERVs) has been to undergo recombinational deletion. This process involves homologous recombination between the flanking long terminal repeats (LTRs) of a full-length element, leaving a relic structure in the genome termed a solo LTR. We examined loci in one family, HERV-K(HML2), and found that the deletion rate decreased markedly with age: the rate among recently integrated loci was almost 200-fold higher than that among loci whose insertion predated the divergence of humans and chimpanzees (8 × 10^{−5} and 4 × 10^{−7} recombinational deletion events per locus per generation, respectively). One hypothesis for this finding is that increasing mutational divergence between the flanking LTRs reduces the probability of homologous recombination and thus the rate of solo LTR formation. Consistent with this idea, we were able to replicate the observed rates by a simulation in which the probability of recombinational deletion was reduced 10-fold by a single mutation and 100-fold by any additional mutations. We also discuss the evidence for other factors that may influence the relationship between locus age and the rate of deletion, for example, host recombination rates and selection, and highlight the consequences of recombinational deletion for dating recent HERV integrations.

Endogenous retroviruses (ERVs) are the proviral form of exogenous viruses that have integrated into germ line cells and are passed vertically from parent to offspring (4). Each provirus is composed of several genes bounded by two noncoding regions termed long terminal repeats (LTRs), which are 500 to 1,000 bp in length and are identical upon insertion (integration). Approximately 98,000 human ERVs (HERVs), or fragments of HERVs, in the published human genome sequence have been located (reference 27; also see http://herv.img.cas.cz ), and estimates of the percentage of the human genome that they represent range from 3 to 8% (19, 27). Many of these HERVs have existed for tens of millions of years, during which time approximately 85% of them have undergone recombinational deletion involving the two LTRs. This process results in the replacement of the full-length provirus by a single LTR sequence termed a solo LTR (20, 27, 34) and is a key determinant in the long-term outcome of HERV infection (17).

HERVs are divided into a relatively small number of lineages (or families), each of which is considered to represent the proliferation of an initial independent infection of the ancestor of the human genome up to 70 million years ago (mya) (36). With one possible exception, HERV-K(HML2), all these families have ceased proliferating and have effectively become extinct. The HERV-K(HML2) family, whose method of proliferation appears to be representative of that of other HERVs (2), is therefore the most suitable one for investigating the process of recombinational deletion. The family has homologues in Old World but not New World monkeys (24, 25, 29). The dates of the divergence of humans from Old World monkeys and of humans from New World monkeys are estimated to be approximately 21 to 25 mya and 32 to 36 mya, respectively (11). The family, therefore, must have initially invaded the ancestor of the human genome between these two periods, with 30 mya being a generally accepted estimate. The family appears to have been integrating continuously up to the present day (1, 3, 37) at an approximately steady rate (1).

The rate of recombinational deletion among HERVs may be expected to be related to the increasing mutational divergence between the LTRs as a provirus ages (22). With increasing age, the sequence similarity between the two LTRs of the provirus will decrease due to mutations acquired during host DNA replication. Recombinational deletion is likely to occur via intrachromosomal fold-back loops within the germ line, which are known to be dependent on the similarity between the two recombining sequences (33). Thus, increasing mutational divergence between the LTRs should reduce the probability of homologous pairing and hence of recombinational deletion.

We tested this prediction by determining the rates of recombinational deletion among loci of three different age classes in the family HERV-K(HML2), and we then attempted to recreate these rates in a simulation in which the probability of deletion was dependent upon the mutational divergence between the LTRs.

## MATERIALS AND METHODS

Calculating the proportion of full-length proviruses.Loci in the oldest age category were detected and analyzed by mining the published human and chimpanzee genome sequences (3). We identified all HERV-K(HML2) loci in the human genome sequence whose homologues in the chimpanzee genome sequence are full-length proviruses (the loci we selected do not represent an exhaustive list because the chimpanzee genome project is incomplete). For the two younger age categories, we determined the mean proportion of full-length proviruses among 19 human DNA samples for 63 loci that are represented by solo LTRs in the published human genome sequence. Previously, we have screened these 19 samples for insertional polymorphisms (unfixed loci) (1). In the present study, we rescreened all samples that had an insertion, using a primer (5′ ATTTTACTTTTAGTTAGCCCC 3′) designed against a conserved region within the leader sequence, in order to determine whether the insertion was a full-length provirus or a solo LTR. Flanking primer sequences are available on request. PCR conditions were 40 cycles of 94°C for 1 min, 45 to 55°C for 1 min, and 72°C for 100 s and a final extension step for 10 min (25 pmol of each primer was used with 250 to 500 ng of template DNA). We then combined our findings with previously published data for an additional 11 loci that are represented by full-length proviruses in the published human genome sequence (16). This gave us the proportions of full-length proviruses for a total of 74 loci.

Simulation.To simulate the mutational divergence between the LTRs, we used a background rate of mutation within humans of 2 × 10^{−8} per bp per generation (7). The integration rate was assumed to be constant (1), and mutations in the two LTRs (each 968 bp, the mean length for the family) were randomly acquired in accordance with a Poisson distribution. The probability of deletion per generation was set as 10^{−4}, 10^{−5}, and 10^{−6} for zero, one, and two or more mutations in the LTRs, respectively. For the category corresponding to ages of 6 to 0.8 million years, we excluded those loci that represented the youngest age category. For the oldest category, we simulated the integration and aging of loci over a period of 24 million years prior to the chimpanzee-human divergence, and then for those loci that had survived as full-length proviruses to 6 mya, we simulated the recombinational deletion that occurred over the subsequent 6 million years. The *perl* script to implement this simulation is available upon request.

Our simulation ignored the positions of mutations in the LTRs. If the critical determinant of homologous recombination is the length of a region of identical nucleotides, mutations occurring close together in an LTR (or near the corresponding position in the other LTR) may reduce the probability of homologous recombination less than mutations that occur farther apart. We investigated this idea within our simulation by ignoring mutations occurring less than 500 bp from an existing mutation. This practice led to a slightly increased rate of deletion in older loci but did not affect the overall pattern, giving proportions of full-length proviruses in the three age categories very similar to those produced without this modification (0.30, 0.04, and 0.69 compared to 0.33, 0.05, and 0.74, respectively).

Recombinational deletion and host recombination rate.The effect of variation in local host recombination rates was investigated by assigning every HERV locus a recombination rate. For this assignment, we used a high-resolution recombination map (18) based on 5,136 microsatellite markers and calculated average recombination rates across 3-Mb windows centered on the markers. Each HERV locus was assigned a recombination rate based on the nearest marker (only loci within 1.5 Mb of a marker were included in the analysis), and the means of the recombination rates for the loci were compared using a *t* test.

## RESULTS

Division of HERV-K(HML2) loci into three age categories.We analyzed HERV-K(HML2) loci belonging to three age categories. (i) The first category comprised loci present as full-length proviruses in the published chimpanzee genome sequence that correspond to either a solo LTR or a full-length provirus at the orthologous position in the human sequence. These elements therefore integrated between 30 mya and the human-chimpanzee divergence, which is dated at 6 mya (11, 12). The other two categories were loci, either full-length proviruses or solo LTRs, whose insertion postdated the human-chimpanzee divergence (chimpanzees have the preintegration site) and which were either fixed (ii) or unfixed (iii) in samples from diverse human individuals. We estimated the cutoff point between the latter two age categories to be 0.8 mya, which is the average time of fixation for a neutral allele (14) given a long-term effective human population size of 10,000 and a generation time of 20 years (6, 13, 39).

Calculating rates of recombinational deletion.For a full-length provirus that inserted *t* generations earlier, the probability *P* that now it is still full-length (that is, it has not undergone recombinational deletion) is expressed as follows:
$$mathtex$$\[P{=}(1{-}r)^{t}\]$$mathtex$$(1) where *r* is the probability of deletion in any one generation, assumed to be constant. For the oldest age class, there were 25 full-length proviruses (Table 1) in the common ancestor of chimpanzees and humans, estimated to have existed 6 mya, giving *t* of 3 × 10^{5} generations. Twenty-two of them are still present as full-length proviruses in a single randomly chosen human genome, giving *P* of 0.88. Using equation 1, we therefore calculate *r* to be 4.3 × 10^{−7}. The 2-unit support limits on *P* (equivalent to 95% confidence limits [8]) are 0.71 and 0.97, giving bounds on *r* of 1.0 × 10^{−7} and 1.1 × 10^{−6}. These calculations ignore the possibility of independent recombinational deletion in both human and chimpanzee lineages, but assuming equal rates for all insertions, this probability is small. Similar results are obtained if we include full-length elements from the human lineage and their full-length or solo LTR orthologues in the chimpanzee, expanding the sample size by examining the proportion of shared, homologous loci that have been deleted in the chimpanzee lineage while remaining full-length in the human. We found that 2 of 24 such loci have undergone recombinational deletion in the chimpanzee lineage. Recent evidence suggests that a common generation time of 15 years can be used for much of the human and chimpanzee lineages (9), and applying this figure to the combined data set, we get a probability *P* of 0.90, with 2-unit support limits of 0.79 and 0.96. From these probabilities, we can derive a value of *r* of 2.6 × 10^{−7} (*t* = 4 × 10^{5} generations) with bounds of 1.0 × 10^{−7} and 5.9 × 10^{−7}.

The intermediate age class includes elements that are present (either as full-length proviruses or as solo LTRs) in the published human genome sequence and in all other humans surveyed but are absent in chimpanzees (that is, chimpanzees have a preintegration site). We found 66 such elements (Table 2): for 7 of them, all humans surveyed still had the full-length provirus; for 56 of them, all humans surveyed had a solo LTR; and for 3 of them, we (or the authors of previous reports) found both full-length proviruses and solo LTRs, with frequencies of full-length proviruses of 0.46, 0.84, and 0.95 (average, 0.75). The (unweighted) average frequency of full-length proviruses across these loci thus corresponds to a *P* of 0.14. We assume that all these elements integrated in the period between the splitting of humans from chimpanzees and the average date for the coalescence of nuclear genes, that is, in the period between 6 and 0.8 mya, or 300,000 and 40,000 generations ago. If we assume a constant rate of insertion over this period and a constant rate of recombinational deletion between insertion and the present, then the expected proportion of full-length proviruses in modern humans is expressed as follows:
$$mathtex$$\[P{=}\ \left(\frac{1}{260,000}\right)\ \begin{array}{l}300,000\\{\sum}\\t{=}40,000\end{array}\ (1{-}r)^{t}\]$$mathtex$$(2) Combining this equation with our observed value of *P* of 0.14, we get *r* of 1.5 × 10^{−5}. The 2-unit support limits on *P* are approximately 0.067 and 0.25 (calculated assuming a binomial distribution with *n* of 66), giving bounds on *r* of 9.5 × 10^{−6} and 2.3 × 10^{−5}.

Finally, the youngest age class includes elements that are present (either as full-length proviruses or as solo LTRs) in the published human genome sequence but were absent from at least one human in our survey (that is, some humans had a preintegration site). We found eight such elements (Table 2), and the average proportion of full-length proviruses among all insertions (ignoring alleles with a preintegration site) was 0.30. We assume that all these elements integrated in the time since the average coalescence of nuclear genes, that is, in the last 0.8 million years, or 40,000 generations. If we assume constant rates of insertion and recombinational deletion over this period, then the expected proportion of full-length proviruses in modern humans is expressed as follows:
$$mathtex$$\[P{=}\ \left(\frac{1}{40,000}\right)\ \begin{array}{l}40,000\\{\sum}\\t{=}1\end{array}\ (1{-}r)^{t}\]$$mathtex$$(3) Combining this equation with our observed value of *P* of 0.30, we get *r* of 8.0 × 10^{−5}. It is not clear how to calculate support limits on this estimate, but if we assume a binomial distribution with a sample size of eight, we get conservative bounds on *P* of 0.044 and 0.72, giving bounds on *r* of 1.8 × 10^{−5} and 5.6 × 10^{−4}.

The rate of recombinational deletion inferred from the observed mean proportions thus decreases markedly with the increasing age of the locus (Table 3). There is an almost 200-fold decrease between the rates of the youngest and the oldest age categories of 8.0 × 10^{−5} and 4.3 × 10^{−7} per locus per generation, respectively.

Reproduction of observed recombinational deletion rates by computer simulation.We found that the decreasing rate of recombinational deletion with increasing locus age could be reproduced approximately in a simple simulation in which the probability of recombinational deletion of 10^{−4} per generation was reduced 10-fold by the acquisition of one mutation in the LTRs and reduced 100-fold by the acquisition of two or more mutations (Table 3 and Fig. 1). These parameter values were based on experimental data on the rates of homologous recombination within mammals. For example, it appears that 150 to 500 bp of uninterrupted sequence identity is required for full efficiency, with the rate of homologous recombination declining from threefold to more than 100-fold with a variety of 1- or 2-bp mismatches (21, 28, 38). The observed numbers of mutations in the LTRs of human-specific proviruses lend additional support for the accuracy of our simulation: we observed a mean of nine mutations in these LTRs (*n* = 17), and our simulation predicted a mean of six. We therefore suggest that mutational divergence determines the rate of recombinational deletion among HERVs. The marked decline in the deletion rate explains why most unfixed (insertionally polymorphic) HERV loci are represented only by a solo LTR (even though they are probably only a few hundred thousand years old) yet some proviruses can persist in their full-length state for tens of millions of years. In our simulation, 50% of integrations became solo LTRs within 150,000 years.

## DISCUSSION

There is an earlier and higher estimate (2 × 10^{−3} deletion events per locus per generation) of the rate of recombinational deletion among members of the HERV-K(HML2) family (16). This rate was calculated by counting the number of deletion events that have taken place in 13 human-specific loci and then using standard population genetics theory to extrapolate the number lost by genetic drift. Our figures, which are based on a larger sample size and a different method, are more in agreement with observed recombinational deletion rates among relatively recent integrations in the mouse. For example, the rate is 4.5 × 10^{−6} deletion events in the single ecotropic *Emv-3* locus per generation (31) and averages around 4 × 10^{−6} deletion events per generation (1 in 250,000 meiotic generations) among 103 nonecotropic murine leukemia virus loci (10); both rates were calculated from direct observations of deletions among progeny. Our lower confidence limit for the two younger HERV-K(HML2) categories (ca. 10^{−5}) is close to the upper confidence limits for the mouse loci, which are 8 ×10^{−6} for *Emv-3* (31) and 10^{−5} for the other murine leukemia virus loci (our calculation from data presented in reference 10).

The earlier study on HERV-K(HML2) (16) used sequence data to show that multiple deletion events have occurred at a single solo LTR locus (11q22, or 154c11). Furthermore, the three deletion events observed were estimated to have occurred up to 1.5 million years after the integration of the original full-length element (the LTRs had diverged by several mutations prior to each deletion event). We suggest that this particular locus is either an exception to the general trend that we have proposed (our oldest age category includes a few loci that underwent recombinational deletion at an even greater age) or, as suggested by the authors of the other study, that we may be observing the effect of recombination and/or gene conversion after a deletion event.

We have shown that the rate of recombinational deletion declines with locus age in a fashion that can be explained by the increasing mutational divergence between the LTRs. However, there are two other factors that may complicate this relationship.

First, the background local rate of recombination varies substantially across the human genome (18), and therefore, old, full-length proviruses may have persisted simply because they are situated in genomic regions experiencing low levels of recombination. A pooled analysis of all HERV families (17a) has shown that local variation in the host recombination rate does have an effect on the rate of recombinational deletion, but this effect is small: we estimate that it is sufficient to produce only an approximately threefold difference in the proportion of full-length HERV-K(HML2) proviruses, not the 200-fold difference observed. The effect of the background recombination rate would require a larger sample size and a longer time period to manifest itself, and we thus observe no tendency for the old HERV-K(HML2) proviruses to be in regions of lower host recombination rates than human-specific insertions represented today by solo LTRs (*t* test; *P* = 0.44).

Second, there may be greater selection against full-length proviruses than solo LTRs. There is considerable evidence that ERVs are generally harmful to their hosts (15, 23, 26, 32), and solo LTRs are likely to be less harmful than full-length replication-competent proviruses. Although solo LTRs are unlikely to be neutral in all cases (because the regulatory sequences capable of disrupting host gene expression are present in the LTR), they cannot themselves give rise to further insertions. Additionally, the somatic effects of some ERVs are known to be caused only by the full-length provirus: for example, the mutations *d* (dilute coat color) and *hr* (hairless) caused by ERV integration into the mouse are reversed following recombinational deletion (31, 35). Also, although a medical effect of HERV-K(HML2) is unproven, the injection of the accessory gene *rec* (*cORF*), which is found only in full-length HERV-K(HML2) proviruses, induces tumor formation in immunocompromised nude mice (5). The splice sites in the internal region may also interfere with host transcription. Therefore, among recent insertions, more full-length proviruses than solo LTRs may be lost from the host population as a result of selection factors acting on the host, and thus, fewer full-length viruses would drift towards fixation. If this phenomenon has occurred, it would lead us to overestimate the recombinational deletion rate by reducing the observed proportion of full-length proviruses. However, at the present we have no data on selection that would allow us to correct for this possibility. A point that arises from our analysis is that recently inserted full-length proviruses are perhaps as likely to be inactivated (in the sense of being unable to replicate further) by recombinational deletion as by point mutation. Our inferred recombinational deletion rate for the youngest age category approaches 10^{−4} per generation, and the probability of acquiring a point mutation approximates that figure (given a background mutation rate in humans of 10^{−8}/bp and a typical provirus length of 10^{4} bp), with perhaps 40% of these mutations being lethal (30).

One consequence of the differences in recombinational deletion rates demonstrated above is that the estimated dates of integration of full-length HERVs may, in general, be too old when there are only a few mutational differences between their paired LTRs. This is because such dates are typically estimated from the pairwise differences between LTRs by assuming a neutral average rate of human evolution since their integration. However, as we have shown, proviruses that acquire (by chance) mutations in their LTRs at, or soon after, integration are less likely to undergo recombinational deletion and therefore persist in their full-length state, whereas most other elements rapidly decay into solo LTRs.

## ACKNOWLEDGMENTS

This work was funded by the Wellcome Trust. J.W. and A.K. were supported by Natural Environment Research Council studentships, and A.K. was also supported by a Medical Research Council fellowship.

We thank Vini Pereira and Anna Dawson for help with the genome mining and experimental work, respectively.

## FOOTNOTES

- Received 9 October 2006.
- Accepted 12 June 2007.

- Copyright © 2007 American Society for Microbiology