Previous Article | Next Article ![]()
Journal of Virology, June 2005, p. 6997-7004, Vol. 79, No. 11
0022-538X/05/$08.00+0 doi:10.1128/JVI.79.11.6997-7004.2005
Copyright © 2005, American Society for Microbiology. All Rights Reserved.
Departamento de Bioloxía Celular e Molecular, Universidade da Coruña, A Coruña,1 Hospital Clínico Universitario, Universidade de Santiago de Compostela, Coruña, Spain2
Received 5 October 2004/ Accepted 21 January 2005
|
|
|---|
|
|
|---|
Our genome is plagued with "fossil" remnants of mobilization periods that ceased long ago. And the same happens with the evolutionary history of organisms, some of whose clues can only be found in the fossil record; there are important questions of the evolution of TEs that can only be answered by looking at these genomic fossils. Not the least important of them is what stops the invasion process of a genome by a TE family. The sequenced human genome, harboring thousands of copies from TE families that became "extinct" when they lost their capacity of proliferation, offers an exceptional opportunity to investigate this problem.
Notwithstanding the ultimate beneficial use of particular TE insertions by the host (27, 30) or the importance of TEs as generators of genetic variation in wild populations (23), these sequences generally behave as parasitic, selfish DNAs. Their potential to spread through host genomes and populations relies upon their ability to overreplicate the host DNA in the absence of any selective advantage to their carriers, within an evolutionary context that in many respects recalls an ecological community (7). Several mechanisms that may lead to dynamic equilibria of copy number (11), involving either self-regulation of TEs (40) or the opposing forces of transposition and host fitness effects of increased copy numbers (9), have been proposed and tested against observations. These equilibria may persist at some intermediate value for many generations of the host organism, but finally TEs are expected to be eliminated. Their proliferation may be reduced by their gradual accumulation of degenerating mutations (22) and/or by selection on the host genome to limit the damage caused by TEs (2, 17, 28, 43), leaving behind only the lucky insertions that proceeded to fixation, usually by random drift.
In this paper, we offer some hints of the mechanisms that led to the extinction of an LTR-containing TE family that once thrived in our not-so-distant past. LTR-containing TEs in the human genome are represented mainly by human endogenous retroviruses (HERVs). These fall into three classes, each comprising many families that originated independently from ancient infections of the germ line by different kinds of exogenous retroviruses (25, 41), which have integrated into their chromosomes and then persisted as stable Mendelian factors for multiple generations. Their structure is accordingly quite similar to that of exogenous retroviruses, consisting of an internal sequence with homology to gag, pol, and sometimes env open reading frames, flanked by two LTRs.
HERVs may increase in copy number within the genome either via intracellular retrotransposition (within germ line cells) or through an extracellular infectious phase (reinfection of the germ line). Both pathways are not mutually exclusive. The recent finding of evolutionary constraint in the env gene of several HERV families (1) is consistent with reinfection having been their major means of proliferation. However, apparently this was not the case for the two largest families in terms of copy number (HERV-L and HERV-H). ERV9 is a class I family that was repeatedly mobilized during primate evolution (15, 18), bringing their copy number in the human haploid genome to approximately 120 members distributed on most chromosomes (36), as well as at least 4,000 solitary LTRs (26) produced by recombination between the 5' and 3' LTRs of the same insertion. A reconstruction of the evolutionary history of this family through a paleogenomic analysis of their LTRs (15) led to the identification of 14 subfamilies, integrated in a sequential order in one of four main lineages, presumably corresponding to expansion waves from different master copies (5, 16). The age of these subfamilies was estimated so that they could be placed on the phylogenetic tree of primate evolution. The first of them probably appeared after the split of New World and Old World monkeys (
38 MYA). Then, successive expansions took place, with several subfamilies simultaneously active over long periods of time, particularly in the interval since gibbons began to diverge from higher apes until after the split of gorillas (7 to 15 MYA, according to reference 20). Finally, this high proliferation ceased for unknown reasons, and no new subfamilies have been found since. We have now made a detailed reconstruction of the evolutionary history of the last subfamily of ERV9, named XII, and show that its activity actually went on until the separation of humans and chimpanzees, when it finally ceased, most likely not as a consequence of a more or less slow progressive degeneration of TE sequences but because of a relatively rapid spread of restrictive alleles in the host populations.
|
|
|---|
Insertions were grouped into different sets according to shared nucleotide variants. A nucleotide position was considered diagnostic of a sequence set whenever >70% of the sequences grouped into it shared the same nucleotide, which differed from that characterizing at least some other similar groups. Groups were made up of at least five sequences, sharing two or more correlated nucleotide variants. At least one of these nucleotide variants had to involve a site not diagnosed as a CpG doublet in the subfamily consensus. Occasionally, two subgroups could be established, each consisting again of at least five sequences but sharing just a single nucleotide variant at a non-CpG doublet. Consensus sequences for each set of ERV9_XII insertions were constructed following the same rules described above for the subfamily.
Phylogenetic reconstructions of the consensus sequences of the different sets of ERV9_XII insertions were carried out both by distance (neighbor joining [NJ] with Kimura's two-parameter model with a transition/transversion ratio of 2) and maximum parsimony (MP) methods implemented in the MEGA2.1 package (available at http://www.megasoftware.net). In MP analyses, we searched for the best trees using the close-neighbor interchange, with default parameter values and random addition of sequences to produce the initial trees.
For the following comparative analyses of the different sets of ERV9_XII insertions, it was necessary first to rid them of CpG dinucleotides, whose very high mutation rate (3) could introduce a significant noise in our analyses. Exclusively for that purpose, a subfamily consensus was constructed with a more stringent condition for the diagnosis of CpG doublets: the CpG dinucleotide was chosen as the consensus unless the T or A nucleotides were present in >90% of the sequences, instead of the 70% threshold routinely applied formerly to derive a consensus. All sites that happened to be CpG under this more stringent condition were removed from the general alignment, as well as all sites corresponding to nonconsensus nucleotide insertions.
To estimate the ages of the different sets of ERV9_XII insertions, we first calculated the average number of nucleotide substitutions from their consensus (K), using Kimura's two-parameter model with a transition/transversion ratio of 2. Assuming 0.16% per MYR as the rate of change of pseudogene sequences in primates (15), the average expansion age of each sequence set was estimated as T = K/0.0016.
The strict master model (SMM) postulates that all the insertions of a given set were instantaneously produced by retrotransposition of the same master element. Assuming that the consensus is the best possible reconstruction of the sequence of that master, expected average pairwise divergence between sequences of the same set was derived by Jurka (21), as in the following equation:
![]() | (1) |
Tentative estimates of the durations of expansion periods were obtained following Tachida (37), assuming a transient master copy model. According to equation 19 in Tachida (37) and replacing terms as in Jurka (21), it can be easily shown that
![]() | (2) |
Phylogenetic reconstruction of the sequences of the whole set of ERV9_XII insertions was carried out by NJ, again using Kimura's two-parameter model of nucleotide substitution with a transition/transversion ratio of 2.
|
|
|---|
|
View this table: [in a new window] |
TABLE 1. Identified insertions of subfamily XII of ERV9 (ERV9_XII)a
|
|
View this table: [in a new window] |
TABLE 2. Analysis of divergence in the different insertion groups within ERV9_XII
|
To investigate the evolutionary history of the element during this period, we first made a phylogenetic analysis of the consensus sequences of the different groups, which constitute the best sequence reconstruction of the master elements that gave rise to them. Nucleotide variants of their alignment with the consensus sequences of subfamily ERV9_XI, used as an outgroup (15), subfamily XII, and each of the three hypothetical groups, are shown in Fig. 1. Many of these differences were shared by different groups, suggesting that they can be placed in a sequential order. The phylogenetic tree (Fig. 2) confirmed the presence and ordering of shared variants, showing what seemed to be at first a single, uninterrupted lineage that sequentially gave rise to groups A to D(h) but later split into two lineages, one leading to groups G to H and the other to groups I to M2. The positions of E and F(h) were dubious, since MP analysis places them in the G-H clade (see the supplemental material). The estimated ages of individual groups (Table 2) were not always in good agreement with their sequential order within each lineage. Actually, this finding was not at all unexpected, since each group, except for C and G, showed one or several private differences (Fig. 1 and 2) that indicate that they are not likely to be the direct ancestors of the groups that followed in their lineage. Obviously, there are several transitional stages that left no representatives in the fossil record of the human genome. Each internal node of the phylogenetic tree, except for the dichotomies leading to C and G, bore evidence of the coexistence of at least two master active elements. However, the information on the evolution of the diversity of these masters could not be straightforwardly obtained from the phylogenetic tree, because the evolutionary rates of these active elements may have been rather different. According to parsimony analyses, in the 4 MYR comprising the expansion period of the subfamily, a total of 17 nucleotide changes must have taken place to produce L (see the supplemental material), the master element of the most evolved group, or 0.006 changes per site per MYR (nearly four times higher than the evolutionary rate of pseudogene copies used for dating ERV9 insertions). Of these changes, 10 must have taken place after the splitting of the main lineage into those leading to L and G (the two most recently active groups of the subfamily). However, in this same period, only two changes led to the G master. This may be indicative of heterogeneity in the evolutionary rates of coexisting master elements. On the other hand, our estimates of the ages of the different groups are subject to considerable error (Table 2). This may help to explain some major discrepancies between the age of a group and its position on the phylogenetic tree. Group I, for example, should be relatively old, according to divergence among its representatives, but it occupies an intermediate position in the tree and contains two species-specific insertions. Taking all these factors into consideration, it may be concluded that ERV9_XII sequences experienced a remarkable increase in the diversity of their coexisting active copies in the last two MYR of its history of transpositions.
![]() View larger version (52K): [in a new window] |
FIG. 1. Nucleotide differences among the consensus sequences of the different groups of subfamily XII. The first three rows refer to base positions relative to Fig. 1 in Costas and Naveira (15; see also the supplemental material). Double- and single-underlined positions indicate sites forming part of CpG dinucleotides in the general consensus sequence of subfamily XII, with a 70% or a 90% threshold frequency, respectively (see Materials and Methods). The general consensus sequence of subfamily XI was used as a reference. Dots indicate identity with this reference sequence. Uppercase letters indicate nucleotides present in >70% of the sequences belonging to a group; lowercase letters indicate nucleotides present in 50% to 70% of the sequences in a group.
|
![]() View larger version (13K): [in a new window] |
FIG. 2. Phylogenetic relationships within ERV9_XII, based on analyses of consensus sequences of the different groups established according to shared nucleotide differences. The displayed tree was obtained by the NJ method, and it is rooted with the general consensus of subfamily XI (ERV9_XI), used as an outgroup. Values indicate the percentages of equally parsimonious trees supporting internal branches (only values of >70% are indicated; three equally most parsimonious trees were obtained). Sequence groups that did not depart significantly from a star phylogeny, after contrasting observed and expected values of Jurka's coefficient, are marked with a star symbol.
|
![]() View larger version (55K): [in a new window] |
FIG. 3. Phylogenetic reconstruction of the full set of ERV9_XII sequences (97 solitary LTR insertions) by the NJ method, after CpG positions were excluded. The tree is rooted with the consensus of ERV9_XI. Each sequence is designated by its GenBank accession number, followed by the name of the group it has been assigned to in this work.
|
Application of Tachida's transient master copy model to the three groups that depart significantly from star phylogenies (namely, A, C, and E) led to point estimates for the persistence of their expansion periods of 1.4, 0.76, and 1.1 MYR, respectively. By contrast, the seven groups that show a good fit to a star phylogeny (i.e., to an "instantaneous" expansion) would have actually expanded for only 0.1 MYR, on average.
|
|
|---|
This study was confined to roughly the last four MYR of existence of ERV9 as a TE, when it gave rise to subfamily ERV9_XII, 6 to 10 MYA, in the most recent common ancestor of humans and chimpanzee. Our examination of ERV9_XII sequences reveals that during the first half of this time interval, three major expansion waves of variants of a dominant lineage took place at different times. One of these expansions (group B) may have been instantaneous, meaning that the proliferation rate was probably much higher than the mutation rate per base pair between expansion periods. The other two, groups A and C, should have persisted for 69,000 and 38,000 generations, or 1.4 and 0.76 MYR (assuming a generation time of 20 years), respectively, which is perfectly congruent with our estimations of the ages of the different groups. Then, in the two MYR prior to its extinction, ERV9_XII appears to have been engaged in frenetic activity, which produced at least 75% of the insertions of this subfamily, distributed among eight groups and two lineages. All these groups except E, whose expansion is expected to have persisted for 56,000 generations, may have been produced by "instantaneous" expansion of single-sequence variants. Interestingly, according both to age estimations based on divergence within groups and, above all, to the presence of a few species-specific insertions, several of these groups were most likely simultaneously active just during the first stages of speciation of the genera Homo and Pan. Remarkably, three species-specific insertions have been identified, representing the first reported fixed differences of this kind between humans and chimpanzees, apart from those belonging to the HERV-K family (29). They most probably correspond to insertion polymorphisms in the most recent common ancestor of these two species, which became fixed for alternative alleles after separation of their gene pools.
The human genome harbors nearly half a million copies of roughly 100 HERV families (25). All of these families, except one, are now apparently extinct, i.e., they can spread no further over the genome. The only exception is HERV-K, which has three human-specific subfamilies (8), some of whose insertions are polymorphic in modern human populations and thus may still be capable of movement (42). The ultimate cause of the extinction of a TE family will be the reduction of its proliferation rate below a certain threshold, which depends on the per-nucleotide mutation rate. Thus, in Drosophila, where transpositions are relatively frequent, a TE jumps on average once in 104 to 105 generations, and the mutation rate is 109 to 108 per bp per generation, so that a copy of a typical element (104 bp) is expected to accumulate at least 1 mutation between jumps (31). This amount may not be considered too serious a risk for losing copy functionality, but if transposition rate is further reduced or the mutation rate is increased, many TEs may certainly die before they have a chance to transpose. But this is not likely to have been the case of ERV9. All the groups that make up subfamily XII appear to have been the result of independent expansions from single sequence variants, each in the elapsed time of the order of 103 to 104 generations, which certainly leaves very few opportunities for the gradual degeneration of the population of sequences.
Another possibility leading to extinction of a TE family is the fixation of restrictive factors in the host population. Host genomes have adopted several defense strategies against TEs and viruses, as part of their intracellular and extracellular conflicts for over a billion years of coevolution. One of the most useful and simple models for analyzing these relationships between TEs and the host genome is offered by Drosophila. In laboratory lines of D. melanogaster, different families of TEs are active in different lines, and transposition rates vary widely among families, with some of them transposing at very high rates and the rest remaining stable. Thus, unstable lines have been found for either gypsy or copia and have been shown to carry permissive alleles, which specifically release the host control on the copy number of the corresponding family; stable lines have been shown to carry alleles that restrict their transposition (see reference 31 and references therein). A repressive state specific for a given family may be established by homology-dependent trans-silencing mechanisms, produced by either transcriptional (inactivation of the promoter) or posttranscriptional (sequence-specific RNA degradation) molecular mechanisms. They were first described with transgenic plants but now appear to have a general role in genome defense against viruses and mobile elements in a broad range of normal organisms (10, 28, 35, 43). However, the best-characterized mechanisms for restricting proviral amplification in both exogenous and endogenous viruses involve different ways of preventing their binding to cell surface receptors, such as the Fv4 gene in mice (39), or hindering preintegration steps of retroviral replication, as in Fv1, Lv1, and Ref1 (2). One of the most remarkable aspects of these different kinds of control mechanisms is that the involved genes are frequently derived from specific TE or provirus copies, not necessarily from the same family that is under its control. Finally, in this succinct list of restriction factors, cytoplasmic RNA/DNA editing enzymes have been added to the intracellular repertoire of defenses in primate genomes, after recent studies of human cell line variation in susceptibility to HIV infection (34). Restrictive and permissive factors are likely to segregate in natural populations of all organisms, and their frequencies are probably the major determinants of the proliferation rates of the different TE families residing in the genome. Sometimes TEs escape from the control of the host and begin to expand in an explosive manner, bringing about a reduction in the relative fitness of the bearers of permissive alleles. Thus, the frequency of restrictive alleles in that population is expected to increase; if they happen to be finally fixed in the species, the corresponding TE family might have been repressed in relatively very few TE generations and so come to a "sudden" extinction just following a period of flourishing activity. This is precisely what seems to have happened with ERV9, which according to our data may have gone extinct in approximately 100,000 years (5,000 generations), after 32 MYR of residence as an active TE in the genome of our ancestors (15), interestingly just before the separation of the human and chimpanzee lineages. It would be very useful to know whether the same pattern applies to the many other families of extinct HERVs harbored by our genome. We are just beginning to understand the genetic basis of trans-silencing mechanisms, and it will probably take a long time to assess the relative strength of the evolutionary forces acting on their variation in natural populations. Until then, we may only guess by examining their putative effects on the populations of TE sequences.
Supplemental material for this article may be found at http://jvi.asm.org/. ![]()
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»