Previous Article | Next Article ![]()
Journal of Virology, October 2005, p. 12507-12514, Vol. 79, No. 19
0022-538X/05/$08.00+0 doi:10.1128/JVI.79.19.12507-12514.2005
Copyright © 2005, American Society for Microbiology. All Rights Reserved.
Department of Biological Sciences, Imperial College, Silwood Park Campus, Ascot, Berks SL5 7PY, United Kingdom,1 Plymouth Marine Laboratory, Prospect Place, The Hoe, Plymouth PL1 3DH, United Kingdom2
Received 27 April 2005/ Accepted 1 July 2005
|
|
|---|
|
|
|---|
Within the published human genome sequence, there are over 98,000 human endogenous retroviruses (HERVs), but all are defective, containing nonsense mutations or major deletions. No replication-competent HERVs have been identified to date (26, 31, 33, 35), with only one (K113) with open reading frames for all genes (35), and thus their activity and infectivity is thought to have decreased substantially from levels occurring during earlier periods of primate evolution (1, 23, 34).
One possible exception to this trend is the HERV-K(HML2) family, which makes up less than 1% of HERV elements (27). This family has been active and infectious for much of the past 30 million years (2, 12, 20, 28, 35). It contains many members that inserted into the genome after the divergence of humans and chimpanzees approximately 6 million years ago, as well as several that are insertionally polymorphic (some human individuals have the insertion while other individuals have the empty, preinsertion site) (13, 21, 25, 35). Here we provide the first measures of the overall genomewide frequency of both human-specific and insertionally polymorphic elements in this HERV family. Full-length human-specific HERV-K(HML2) loci have been screened previously for insertional polymorphisms (21, 35), but this is not the case for the solo LTRs, which are much more abundant and can therefore provide substantially more data on the insertional history of an endogenous retrovirus family.
We also compare our observed level of insertional polymorphism to the value that we might expect if the HERV-K(HML2) family was still actively inserting at present. We generate this expectation by using a standard neutral population genetic model, whose two parameters are (i) an insertion rate that is calculated from the number of human-specific insertions in the published human genome sequence together with an estimate of the number of generations since the human-chimpanzee divergence and (ii) an estimate of the long-term population size in humans, as taken from the literature. The possibility that the family is active today is particularly important because it has been implicated in a range of human diseases.
|
|
|---|
|
View this table: [in a new window] |
TABLE 1. Provenances of genomic DNA samples
|
model of sequence evolution (parameter values estimated from the data). The phylogeny was rooted by using an element also present in the chimpanzee and gorilla genomes. Calculation of the insertion rate (µ). The average rate of insertion since the divergence of humans and chimpanzees was calculated by dividing the number of human-specific insertions by the number of generations in the human lineage since divergence, assuming an average generation time of 20 years (11, 14, 15).
The model and its parameters. The program ms (19) generates samples drawn at random from a population obeying the Wright-Fisher model of genetic drift and an infinite-sites model of mutation (18). The infinite-sites model was used as it allows for an unlimited number of unique sites (in this case, loci) into which elements can insert and does not allow reversals to the preinsertion state. Briefly, the program performs the following functions. (i) It generates random genealogies for a specified number of samples, which in our case represent haploid genomes (a total of 39 representing the 19 human DNA samples plus the human genome sequence). (ii) Branch lengths are calculated (in terms of numbers of generations) using coalescent theory. (iii) Mutations, which in our case represent insertions, are randomly distributed onto these branches (following a Poisson distribution). (iv) The distribution of insertions among each sample is output from the program as a binary list (at each locus, 0 denotes a preinsertion site and 1 denotes an insertion). We then randomly selected one of these samples to represent the human genome sequence and calculated the number of loci that were represented by an inserted element in this sample but were insertionally polymorphic in the other 38 samples. We ran 1,000 simulations, and for each we incorporated free recombination by summing the results from 10,000 coalescent trees, on each of which the insertion rate was 0.0001 µ. It should be noted that we are considering here only insertions that are neutral, since insertions harmful to the host are likely to be lost rapidly from the host population as a result of selection.
|
|
|---|
![]() View larger version (20K): [in a new window] |
FIG. 1. Maximum likelihood phylogeny of HERV-K(HML2) LTRs. Filled and open circles indicate LTRs from full-length and segmentally duplicated elements, respectively. Black boxes represent taxa present in both the chimpanzee and human genome sequences, whereas red boxes represent human-specific elements. Intermingling of the two classes is probably due to a variety of factors, such as gene conversion or ancestral polymorphism. Open boxes represent taxa whose distribution could not be determined directly (the chimpanzee genome project is incomplete), and their probable distribution was estimated from their position in the phylogeny. A dashed line indicates the placement of K113, which is absent from the published human genome sequence. The large boxed region (which excludes most non-human-specific elements) in shown in more detail in Fig. 2. Scale bar shows mean number of substitutions per site.
|
Six of the 63 solo LTRs that were successfully amplified displayed insertional polymorphism (Fig. 2 and 3). Only one of these polymorphisms has been previously described (25). The status of 30 other solo LTRs could not be determined, as many are located in highly repetitive regions and gave multiple bands when amplified (unpublished results). Another two insertional polymorphisms are known from previous screening of the 15 full-length elements: 8c8, also known as K115 (35), and 154c11, also known as 11q22 (21). Thus, a total of 8 out of 78 tested elements (63 solo LTRs plus 15 full-length elements) in the published human genome are insertionally polymorphic (Fig. 3 and Table 2). Furthermore, assuming that the 30 untested loci are as likely to be polymorphic as those investigated successfully, another three loci (8/78 x 30) will be insertionally polymorphic; thus, a total of about 11 of the HERV-K(HML2) elements from the published sequence will display insertional polymorphism in our sample of human individuals. Most of the polymorphic loci are near the tip of the LTR phylogeny, indicating that the insertion events are likely to be relatively recent (Fig. 2). Also, the phylogeny shows many nodes near the tips, indicating recent insertional activity. The frequencies of the inserted elements range from 0.04 to 0.94, with a mean frequency of 0.61. Five of the six polymorphic loci that we examined displayed either preinsertion sites or solo LTRs, while only one, 859c12 (Fig. 3), displayed all three states including the full-length element. This suggests that in most cases solo LTR formation via recombinational deletion occurs rapidly and before the element reaches fixation. Indeed, only in the case of the element 8c8 (K115) is there no evidence of solo LTR formation, and this element has a low inserted allele frequency of only 0.04 within the human population (35).
![]() View larger version (31K): [in a new window] |
FIG. 2. Elements screened for insertional polymorphism. Taxon names are followed by their genomic location in parentheses. Black boxes indicate elements homozygous for the insertion in all 19 individuals surveyed, whereas those in red display insertional polymorphism, with the filled region in each box being proportional to the frequency of the inserted element in the samples. The other (nonboxed) elements gave inconclusive results, usually because of their location in regions of highly repetitive DNA. Data from all full-length elements were taken from previous reports (21, 35). A nucleotide alignment of the surveyed solo LTRs, together with their flanking sequences, is shown in Fig. S1. Scale bar shows mean number of substitutions per site.
|
![]() View larger version (61K): [in a new window] |
FIG. 3. Detection of HERV-K(HML2) preinsertion sites. Amplification of solo LTR loci within the published human genome sequence showed a preinsertion site (PRE), a solo LTR (sLTR), or a full-length element (FULL) when tested against a panel of 19 individuals. The provenance of each individual is shown in Table 1, and allele frequencies are shown in Table 2. The identity of each band was confirmed by DNA sequencing. We rescreened the previously identified polymorphism 165c5, as the originally estimated frequencies were based largely on individuals from Russia (25).
|
|
View this table: [in a new window] |
TABLE 2. Polymorphic HERV-K(HML2) insertions and their allele frequencies
|
We therefore calculated a frequency distribution (see Materials and Methods) for the expected number of loci in the published human genome sequence that would be insertionally polymorphic when compared to our sample of 19 individuals, assuming activity of the HERV-K(HML2) family until the present. Parameters used were our estimated insertion rate since the divergence from chimpanzees (µ = 3.8 x 104) and the previous estimate of long-term effective population size (Ne) of 10,000 (17, 36). Given that a polymorphism has a probability of 0.72 (78/108) of being detected in our survey (i.e., is in a region of the genome that can be amplified by PCR), the model predicts a mean of 10.6 polymorphic insertions, with 95% bounds of 5 and 18. Our observed value of eight polymorphic insertions is well within this range (P = 0.57). Moreover, the distribution of insertion frequencies is not significantly different from that predicted by the model (P is
0.25 [two-tailed] for the mean, variance, and skew). This result is robust to using other plausible parameter values. For example, our observed figure of eight polymorphic sites is not statistically different from the model's predictions if human generation time is 10 rather than 20 years or if the time since the human-chimpanzee divergence is 4.5 rather than 6 million years.
Note that the total number of polymorphic sites within the sample cannot be verified directly because we tested only those sites in the published human genome sequence that have an inserted element. Thus, after 1,000 replicates, the model predicts that there would be a mean of 63.4 polymorphic loci within our sample of 39 haploid genomes (of which 45.6 lie in amplifiable regions) but that only 14.5 (10.6 in amplifiable regions) of these would be present as inserted elements (as opposed to preinsertion sites) in any one haploid genome. Furthermore, although the model assumes neutrality, our inferences from it do not depend upon all insertions being neutral. Instead, we assume that elements with a negative effect on host fitness are lost from the host population and that we are thus observing the net rate of accumulation.
|
|
|---|
(12pq), where p is the frequency of the insertion and q is the frequency of the preinsertion, and assuming Hardy-Weinberg equilibrium]. This is consistent with an infinite-sites mathematical model, which also predicts that 6% of the population will be homozygous at all sites (18). Gene flow between human subpopulations is relatively high, and so the assumption of random mating used to derive these expectations will not unduly bias our results (10, 30). Recently, Bennett et al. (3) examined the equivalent of an additional haploid genome for insertional polymorphisms and identified two HERV-K(HML2) sites as polymorphic when compared to the published human genome sequence. This is not significantly different from our result, as the model predicts that approximately 18% of individuals will be heterozygous at fewer than three sites. Is the HERV-K(HML2) family active in present-day humans? The high level of observed insertional polymorphism within the HERV-K(HML2) family indicates that a substantial number of insertions have occurred since the divergence of the human individuals investigated in this study. Furthermore, there are now several lines of evidence to suggest that the family may well still be active at present. First, there is the close match between the observed eight insertional polymorphisms and the number predicted using the population genetic model, which assumes continued activity until the present. However, we note that a recent cessation of activity would also be consistent with our data because there would still be insertionally polymorphic elements retained in the present-day human population. For example, in the model, stopping new insertions 650,000 years before the present predicts a mean of 2.9 polymorphic insertions, with 95% bounds of 0 and 7. Our observed value of eight polymorphic insertions falls outside these bounds, but it does not if the cessation was more recent (e.g., a cessation at 500,000 years ago gives a mean of 4.2, with 95% bounds of 1 and 9). Although we cannot exclude this possibility, we think it is unlikely. Also, the mean insertional rate has remained the same since the human-chimpanzee divergence: we found 440 HERV-K(HML2) elements (including solo LTRs) in the published human genome sequence that had inserted before the divergence of humans and chimpanzees (unpublished data). This gives a mean rate of 18 insertions per million years for the first 24 million years of the family's history, compared to 19 insertions per million years for the last 6 million years. The second line of evidence indicating continued activity is the phylogenetic pattern: most insertionally polymorphic elements, as well as many nodes, are near the tips of the phylogeny. Finally, the young age of some full-length elements, as determined by their LTRs having identical sequences, is compatible with continued activity.
Thus, we believe that the simplest explanation for our data is that the family is active at the present day. If we are correct, then a number of predictions can be made regarding HERV-K(HML2) polymorphism. The insertion rate for humans as a whole will be 2 Nµ, which suggests there are now approximately 4.5 x 106 new insertions occurring every generation (assuming a human population size [N] of 6 x 109), and the total number of polymorphic elements will be substantially higher than this figure. We have also shown previously that most HERV-K(HML2) insertions are the result of reinfection rather than retrotransposition within germ line cells (2), and thus the family is likely to be infectious as well as insertionally active. This reinfection may require movement only between cells of the same individual and does not necessarily require infectious transfer between individuals.
A model of HERV-K(HML2) evolution. The absence of known, infectious members of the HERV-K(HML2) family and the lack of elements with a full coding potential within the published human genome sequence appears, initially, to contradict our conclusion that the family is likely to be active at present. Furthermore, the modeling presented above is based on insertions being neutral and therefore excludes any elements that never reach high allele frequencies due to negative selection acting on the host. Such elements are also ignored by our population genetic model. To take these factors into account, we propose the following scenario (shown in Fig. 4) for the evolution of the HERV-K(HML2) family. We suggest that there is (and has been for many millions of years) a large population of unfixed HERV-K(HML2) elements within the human germ line and that a subset of these elements is both active and infectious at any one time point. Because many of the active and infectious elements may be deleterious to their hosts, they are likely to be present only transiently and to rarely (due to negative selection) reach high allele frequencies in the population as a whole. Some of these elements then acquire, by chance, knockout mutations (for example, via recombinational deletion or frameshift mutations); it is these elements, now neutral and defective, which are able to reach high allele frequencies, and a few eventually become fixed. Thus, it is not surprising that the published human genome sequence [which contains most of the HERV-K(HML2) sequences characterized to date] contains no intact members; it is best regarded as a depository of old, defective elements that have drifted to fixation. This is because there is only a very small chance that any one individual or genome harbors one of the active and infectious members of the current HERV-K(HML2) population. We also note that recently inserted elements are less likely to have undergone the recombinational deletion events we observed for most of the solo LTR loci described here.
![]() View larger version (32K): [in a new window] |
FIG. 4. Proposed model of HERV-K(HML2) family evolution within humans. (a) At each time point, there is a large unfixed population of elements, a proportion of which are replication competent and infectious, whereas others are defective. Some of the subset of defective elements, but none of the replication-competent elements, eventually drift to fixation. The population of unfixed elements is continuously replenished by new insertions resulting from the replication of intact and unfixed elements. (b) Over time, the fixed and defective elements (i.e., A, B, and C) accumulate so that in any one genome all, or almost all, of the elements are defective, the intact and infectious elements being present only in a very small proportion of individuals.
|
We consider that the rarity of novel HERV-K(HML2) insertions may explain why no active, disease-causing elements are known. Another type of retrotransposable element, long interspersed nuclear elements (LINEs), is much more active in humans, with a long-term accumulation rate of 4 x 103 elements per haploid genome per generation (7). Experimental work indicates that the actual frequency of novel LINEs among human individuals may be between 1 in 2 and 1 in 33 (9). However, despite this high level of activity, it appears that new insertions by LINEs are responsible for only approximately 1 out of every 1,000 disease-causing mutations in humans (24). Thus, disease-causing mutations caused by members of the HERV-K(HML2) family may well be sufficiently rare to have escaped detection to date.
We thank Vini Pereira and Aris Katzourakis for help with the mining of the human genome, Peter Kabat and Jonathan Ng for help with the molecular analysis, and Dick Hudson for advice on using ms.
Supplemental material for this article may be found at http://jvi.asm.org/. ![]()
|
|
|---|
es, A. Burt, and M. Tristem. 2004. Long-term reinfection of the human genome by endogenous retroviruses. Proc. Natl. Acad. Sci. USA 101:4894-4899.
es, J., A. Pavlí
ek, and V. Pa
es. 2002. HERVd: database of human endogenous retroviruses. Nucleic Acids Res. 30:205-206.
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»