Previous Article | Next Article ![]()
Journal of Virology, February 2008, p. 1819-1826, Vol. 82, No. 4
0022-538X/08/$08.00+0 doi:10.1128/JVI.01926-07
Copyright © 2008, American Society for Microbiology. All Rights Reserved.

School of Biological Sciences, The University of Hong Kong, Hong Kong, China,1 State Key Laboratory of Virology, Wuhan Institute of Virology, Chinese Academy of Sciences, Wuhan, Hubei, China,2 Bioinformatics Institute, University of Auckland, Auckland, New Zealand3
Received 3 September 2007/ Accepted 21 November 2007
|
|
|---|
|
|
|---|
Genetic analysis revealed a considerable diversity among Bt-SLCoV genomes, suggesting the presence of a wide spectrum of genetically diverse Bt-SLCoVs in various bat species (43). In addition, previous studies indicated a high seroprevalence against Bt-SLCoVs among various bat populations (29, 31). Therefore, bats were proposed to be the natural reservoir of the lineage of SLCoV and SCoV. Nonetheless, based on the relatively distant phylogenetic relationship between Hu-SCoVs and Bt-SLCoVs, researchers suggested that none of the currently sampled Bt-SLCoVs is the descendant of the direct ancestor of Hu-SCoVs (51). Therefore, the direct ancestor of Hu-SCoVs, as well as its corresponding host species, remains elusive.
In this study, we reanalyzed the available Bt-SLCoV genomes and identified a possible recombination event within the genome of a Bt-SLCoV. Phylogenetic analysis of its parental regions suggests the presence of an uncharacterized SLCoV lineage that is phylogenetically closer to Hu-SCoVs than any of the currently sampled Bt-SLCoVs and is therefore a candidate for the direct ancestor of the Hu-SCoV lineage.
To investigate the time of divergence between Hu-SCoVs and this SLCoV lineage, we analyzed the SCoV and SLCoV genome data under both strict- and relaxed-molecular-clock models. Previous studies demonstrated that the rate variations among lineages can mislead estimation of the divergence date if a strict clock is assumed (54). In contrast, if the data set is clocklike, assumption of a molecular clock increases the precision of rate estimates without compromising accuracy (14). The choice of a molecular-clock model is thus crucial for accurate molecular dating. Therefore, we analyzed our data sets under various Bayesian molecular-clock models, aiming to place a robust time scale on the interspecies transmission of Bt-SLCoVs and to provide insights into the zoonotic origin of Hu-SCoVs.
|
|
|---|
Estimation of the potential recombination breakpoint location. The data set was further analyzed using single-breakpoint estimation algorithms implemented in Genetic Algorithms for Recombination Detection (GARD) and Likelihood Analysis of Recombination in DNA (LARD). Based on the bootscan analysis, only the 2,000 nucleotides (nt) around the open reading frame 1b (ORF1b)/S junction (nt 20150 to 22202; all nucleotide numberings in this study are based on AY274119) were analyzed in order to increase the precision of recombination breakpoint estimation. Based on the RDP results, three selected taxa, Rp3, Tor2, and Rm1, were used in the analyses described below. Briefly, GARD uses a genetic algorithm to search for the best breakpoint locations (23). LARD uses a maximum likelihood (ML) method and a likelihood ratio test (LRT) to access the significance of the inferred breakpoint (15). To demonstrate that the detected recombination event is not likely to be a result of random chance (15), the likelihood ratio (LR) of our data set was evaluated against the null distributions of LRs of 1,000 simulated data sets, assuming no recombination, using Seq-Gen (42).
Investigation of the phylogenetic origin of the potential parents.
The genome regions 5' upstream and 3' downstream of the estimated breakpoint were designated major and minor parental regions, respectively. To investigate the phylogenetic origins of these potential parents, coding sequences of essential ORFs of the major (i.e., ORF1) and minor (i.e., S, E, M, and N genes) parental regions of selected CoV strains (n = 13) were aligned independently using ClustalX based on their codon sequences. The aligned ORFs of the two parental regions were degapped and concatenated separately, generating two alignments of 20,085 bp and 5,778 bp for the major and minor parental regions, respectively. For each of the parental regions, phylogenies were constructed using the Bayesian Markov chain Monte Carlo (BMCMC) method. The BMCMC analyses summarized the majority consensus trees produced by two sets of four tempered MCMC chains of 107 states sampled every 1,000th generation, with the initial 10% of states discarded. The Bayesian phylogenetic analysis was performed with MRBAYES 3 (44) under the best-fit substitution model determined by MRMODELTEST 2 (http://people.scs.fsu.edu/
nylander/). According to the BMCMC phylogeny (see Fig. 2A), the major parental lineage of Rp3 is designated the human-bat SLCoV (HB-SLCoV) lineage based on its close phylogenetic relationship with the Hu-SCoV lineage.
![]() View larger version (39K): [in a new window] |
FIG. 2. Phylogenetic origins of the major and minor parental regions of Rp3. ML phylogenies were constructed from the concatenated sequences of the essential ORFs of the major (A) and minor (B) parental regions of selected CoVs. For the purposes of display, the phylogenies were midpoint rooted. The taxa were annotated according to their accession numbers and host species—civets (C), humans (H), or bats (B)—and strain names. The numbers on the left of the nodes refer to the BMCMC posterior probabilities. The percentages of support for all other internal nodes within the two lineages were omitted for simplicity. The recombinant strain Rp3, the most recent common ancestor of Hu-SCoVs (MRCA-Hu), and the divergence event between Hu-SCoVs and HB-SLCoVs (DIV-Hu/HB) are indicated. The scale bars are in units of nucleotide substitutions per site.
|
First, the strict molecular clock (i.e., a constant rate of evolution) of the two data sets was evaluated in an ML framework using PAML 3.15 as previously described (16, 41, 53). Briefly, the performances of the single-rate dated-tip (SRDT) (i.e., strict-clock) and the different-rate (DR) (i.e., no-clock) models in the data sets were compared using an LRT. Second, the two data sets were analyzed under the strict-clock model (CLOC), as well as the uncorrelated exponentially and lognormally distributed relaxed-clock models (UCED and UCLN) in a Bayesian framework. The CLOC model assumed a constant rate of evolution throughout the tree. The UCED and UCLN models assumed independent rates on different branches, which were drawn from an underlying exponential and lognormal distribution, respectively (6). These clock models are implemented in BEAST 1.4 (8). The MCMC chains were run for 5 x 106 (S1 data set) or 1 x 108 (ORF1 data set) states sampled every 1,000 generations with the initial 10% of burn-in samples discarded (7). For both data sets, the best-fit substitution model was the general time-reversible (GTR) model allowing four categories of gamma-distributed rate heterogeneity distribution and a proportion of invariant sites (GTR +
4 + I), as determined by MODELTEST. Since the past population dynamics of the data sets were not the primary interest of our study, we assumed a constant coalescent tree prior for all analyses, with a Jeffreys prior on the constant population size hyperparameter (7). To investigate if this tree prior biased our date estimation, we also analyzed our data sets using a Yule tree prior, which assumes a constant speciation rate per lineage (6). All MCMC chains were independently run twice for the same analysis.
To use information from the S1 data set to improve our estimate of tDIV-Hu/HB from the ORF1 data set, an S1-derived prior distribution was specified on tMRCA-Hu, which is a divergence event shared by the phylogenies of both data sets. This prior distribution was based on the posterior distribution of tMRCA-Hu estimated from the S1 data set under the best-fit clock model. The mode and parameters of this distribution were estimated using distribution-fitting software, EasyFit 3.2 (MathWave Technologies). The MCMC chains for the ORF1 data set were rerun under the same configurations described above, except an S1-derived prior was specified on tMRCA-Hu. For all Bayesian analyses, median and the highest posterior density regions at 95% (HPD) of the parameters were summarized from two identical but independent MCMC chains using TRACER 1.3 (http://beast.bio.ed.ac.uk/). The adequacy of sampling was assessed via effective sample size, which was larger than 200 for all summary statistics investigated (all xml files for BEAST are available as supplementary material at http://evolution.hku.hk/SARS_dating.htm).
Comparison of the performance characteristics of Bayesian clock models. To compare the performance of any two Bayesian clock models for the same data set, the Bayes factor (BF) was calculated. The BF is the ratio of the marginal likelihoods of the two models. A simple method described by Newton and coworkers (39) computes the BF via importance sampling. A BF of >20, or a ln BF of >2.99, is defined as strong support for the favored model. Clock models of the same data set were compared two by two, and estimates of the best-fit model were taken as the final results.
|
|
|---|
![]() View larger version (27K): [in a new window] |
FIG. 1. Detection of recombination and estimation of a breakpoint within the genome of Rp3. A similarity plot (A) and a bootscan analysis (B) detected a single recombination breakpoint at around the ORF1b/S junction. Both analyses were performed with an F84 distance model, a window size of 1,500 bp, and a step size of 300 bp. The Hu-SCoV group includes strains Tor2 (AY274119), GD01 (AY278489), ZJ01 (AY297028), SZ3 (AY304486), GZ0402 (AY613947), and PC4 (AY613950). (C) Organization of essential ORFs of the SCoV genome and location of the estimated breakpoint. The blue and red horizontal arrows represent the essential ORFs from the major and minor parents, respectively. A sequence alignment of the ORF1b/S junction regions of Rp3, Tor2, and Rm1 is shown below. A consensus IGS and the coding regions of ORF1b and S are annotated above the alignment. The black vertical arrow below the alignment indicates the estimated breakpoint located immediately after the start codon of the S coding region.
|
Genomes of CoVs are reported to have relatively high recombination rates (28). For example, experimental recombination of temperature-sensitive mutants and the wild type of mouse hepatitis virus strains have been studied extensively (21, 27, 34). Moreover, evidence of recombination has also been reported in field isolates of infectious bronchitis virus (19, 25, 30) and feline CoV (13). The occurrence of a high frequency of homologous RNA recombination in CoV genomes is probably related to the unique discontinuous transcription mechanism of its mRNA, in which the nascent RNA transcripts must dissociate from the template and fuse with the leader RNA to a distant mRNA start site (28). Regular dissociation and rejoining of the complex of polymerase and nascent RNA during transcription are similar to the template-switching mechanism in "copy choice" model of recombination in RNA viruses (26). In fact, one of the most utilized recombination sites within the mouse hepatitis virus genome is at the junction between the leader RNA and the remainder of its genome (22). In addition, a previous report suggested that the consensus intergenic sequences (IGS) and the highly conserved sequences around this region may serve as recombination "hot spots" in infectious bronchitis virus (25). In this study, we identifed a potential recombination site immediately after the consensus IGS (17), suggesting that the replication intermediates may participate in the recombination event, as speculated previously in other CoVs. Previous studies suggested that the relatively high rates of recombination and mutation may facilitate the cross-species transmission of CoVs (2, 3), and therefore, CoVs were speculated to be potentially important emerging pathogens (1). A wider surveillance of Bt-SLCoVs may shed light on the possible roles of this observed recombination event in the emergence of SARS.
Phylogenetic origin of the putative parental strains. To investigate the phylogenetic origin of the putative parents, two BMCMC phylogenies were constructed based on the major (Fig. 2A) and minor (Fig. 2B) parental regions, respectively. The minor parental region of Rp3 was clustered within the Bt-SLCoV lineage and shared monophyly with Rm1 and BtCoV/279/2005 (Fig. 2B). This suggests that the potential minor parent of Rp3 is probably a Bt-SLCoV that shared a close phylogenetic relationship with Rm1 and BtCoV/279/2005. It has been suggested that there is species-specific host restriction of CoVs in bats, since most CoVs from a single bat species grouped together in phylogenetic analyses (48). Moreover, the S protein (which is located within the minor parental region) is the primary determinant of species specificity in CoVs (12, 36), and thus, we speculate that this minor parent may be a Bt-SLCoV residing in Rhinolophus pearsoni, i.e., the host species of Rp3.
On the other hand, the major parental region of Rp3 grouped with, but clustered outside of, the Hu-SCoV lineage (Fig. 2A). Based on this observation, the potential major parent of Rp3 is possibly derived from an uncharacterized lineage that is phylogenetically closely related to Hu-SCoVs. The host species of this speculative parental lineage cannot be ascertained, as it was clustered within neither the Hu-SCoV nor the Bt-SLCoV lineage. Here, we outline three possibilities regarding the host species of this lineage. First, the lineage may originate from an unsampled group of phylogenetically distinct SCoVs residing in live-animal market mammals, like civets or racoon dogs. However, extensive surveillances of various mammalian species over a wide range of geographic locations have been performed, and only CoVs that are highly similar to SCoVs in humans were sampled (20). Thus, this possibility seems unlikely. Second, the lineage may originate from an unknown nonbat intermediate host species, which possibly acquired a SLCoV from bats and transmitted the virus to an amplifying host, such as civets, resulting in spillover in live-animal markets in southern China. However, one of the prerequisites for recombination is coinfection of parental strains within an individual. Therefore, recombination of parental strains residing in different species, i.e., bats and the unknown intermediate host in this case, may be rare due to the relatively strict tropism barrier of CoVs (12, 52). Third, the strain may originate from an unsampled SLCoV lineage residing in a bat species that is phylogenetically closer to Hu-SCoVs than all other currently sampled Bt-SLCoVs. Based on the relatively high genetic diversity among the currently sampled Bt-SLCoVs, the existence of an unsampled phylogenetically distinct lineage of Bt-SLCoV is highly likely, and therefore, the third hypothesis seems to be the most plausible. In the discussions below, this parental lineage is therefore referred to as the HB-SLCoV lineage, while the term "Bt-SLCoV lineage" refers to all other sampled Bt-SLCoVs (Fig. 2). This lineage is proposed to contain the major parent of Rp3 and other closely related strains, and we cannot exclude the possibility that the lineage may also contain the direct ancestor of Hu-SCoVs. To further investigate the time of this interspecies transmission event, tMRCA-Hu and tDIV-Hu/HB (Fig. 2) were estimated under various molecular-clock models in both ML and Bayesian frameworks.
Molecular clock-like behavior of the data sets and choice of Bayesian clock models. For the ORF1 data set, under the ML framework, LRT analysis suggests that the SRDT model should be rejected in favor of the DR model (Table 1). Moreover, BF analysis suggests that the UCED model fits the ORF1 data set significantly better than the other two models (Table 2), implying that the rate variations among branches of the ORF1 phylogeny are significant and that a strict clock cannot be assumed. The Bt-SLCoV lineage may contribute to the rate variations in the ORF1 data set, since CoVs of different hosts (i.e., bats and humans or civets) may have different substitution rates.
|
View this table: [in a new window] |
TABLE 1. Details of the two data sets and results of the ML molecular-clock tests
|
|
View this table: [in a new window] |
TABLE 2. Performances of the Bayesian clock models
|
![]() View larger version (27K): [in a new window] |
FIG. 3. tMRCA-Hu estimated from the S1 data set. (A) tMRCA-Hu estimated from the S1 data set under various Bayesian clock models and the ML SRDT model. (B) Posterior MCMC samples (left y axis) of tMRCA-Hu estimated from the S1 data set under the UCED model and the lognormal distribution (right y axis) fitted using Easyfit. The values of the parameters for the lognormal distribution are as follows: = 0.56, µ = –1.00, and = 2.04.
|
tDIV-Hu/HB. A prior was specified on tMRCA-Hu as a lognormal distribution with parameters chosen to fit the posterior distribution estimated from the S1 data set (Fig. 3B). Bayesian inference specifically provides for the incorporation of prior knowledge, and in this way, we were able to combine information from both data sets in the estimation of tDIV-Hu/HB. Under the UCED model, the medians of tMRCA-Hu estimated from the ORF1 data set with or without the S1-derived tMRCA-Hu prior were similar, and the posterior distribution of tMRCA-Hu was not solely dependent on its prior distribution (Fig. 4A), suggesting that the ORF1 data set was providing additional information in the Bayesian inference. Moreover, tDIV-Hu/HB was consistently estimated at a median around the late 1990s with or without the S1-derived tMRCA-Hu prior (Fig. 4B). It was noted that the specification of S1-derived tMRCA-Hu priors substantially narrowed the HDP of the tDIV-Hu/HB estimate by about 40%, i.e., it decreased from 12.8 to 7.7 years (Table 3). Similar results were observed under the UCLN model (the data are not shown for simplicity).
![]() View larger version (27K): [in a new window] |
FIG. 4. Specification of an S1-derived lognormal tMRCA-Hu prior in the analysis of the ORF1 data set under the UCED model. (A) Prior and posterior distributions of tMRCA-Hu. (B) Effects of the tMRCA-Hu prior on the posterior distribution of tDIV-Hu/HB.
|
|
View this table: [in a new window] |
TABLE 3. Estimates from the ORF1 data set under the Bayesian UCED model
|
Assuming there was an interspecies transmission of HB-SLCoVs from bats to an amplifying host (e.g., civets), the upper and lower bounds of this event should be theoretically represented by tDIV-Hu/HB and tMRCA-Hu, respectively (Fig. 5). Therefore, the time period between these two events can be considered the most conservative estimation of the period between the cross-species event and the onset of the epidemic. The median and HPD of this period were summarized by sampling the length of a particular branch (i.e., branch A in Fig. 5) of all time-scaled MCMC phylogenies under the UCED model. This period was estimated at a median of 4.08 years (HPD, 1.45 to 8.84 years) (Table 3). The estimated mean substitution rate of the ORF1 data set under the UCED model was 2.79 x 10–3 (HPD, 1.64 x 10–3 to 4.35 x 10–3) substitution per site per year. This estimate is comparable to a previous estimation for the whole genome of Hu-SCoV (i.e., 0.80 x 10–3 to 2.38 x 10–3) (55) and is at the same order of magnitude as in other RNA viruses (4, 9, 18, 37, 38, 50). In addition, the ORF1 data set was reanalyzed under the UCED model with a Yule tree prior assumption, and the estimate is generally consistent with the estimate under the constant coalescent tree prior assumption, suggesting our date estimation is robust for the choice of tree priors.
![]() View larger version (32K): [in a new window] |
FIG. 5. Estimation of the window period between the cross-species event and the onset of the 2003 SARS epidemic. This time-scaled phylogeny was summarized from all MCMC phylogenies of the ORF1 data set analyzed under the UCED model with the S1-derived tMRCA-Hu prior. The heights of the nodes are represented by the median of their estimates. The HPD of tMRCA-Hu and tDIV-Hu/HB are indicated by gray boxes at these nodes. The taxa were labeled in the same style as in Fig. 2, except their sampling dates were annotated.
|
Based on the S protein sequences of the currently sampled Bt-SLCoV, Li and coworkers (32) pointed out that substantial genetic changes in the S protein are likely to be necessary for the virus to infect humans. Due to the fact that the S protein sequence of the direct ancestor of Hu-SCoV is currently unavailable, the genetic factors (e.g., residues under positive selection) that contributed to the switch of species tropism from the bat to the amplifying hosts cannot be determined. We expect that further characterization of the S sequences of the strains of the HB-SLCoV lineage should provide important information regarding the changes that may contribute to cross-species adaptation of the virus.
The observed genetic diversity among currently sampled Bt-SLCoVs strongly suggests bats, in particular, the genus Rhinolophus, are the natural reservoir of SLCoVs and SCoVs. However, among the 69 species of the genus Rhinolophus, the specific species that harbors the direct ancestor of Hu-SCoVs is still unknown (51). One possibility is that there were two phylogenetically distinct lineages of Bt-SLCoV residing in the bat species R. pearsoni that underwent recombination, giving rise to the recombinant strain Rp3. Thus, we suggest a more focused surveillance of SLCoVs in R. pearsoni, which may provide insights into the prevalence and diversity of this recombinant genotype, as well as the possible direct ancestor of Hu-SCoVs.
Another interesting outcome of our analysis is the very young age of the common ancestor of SLCoVs in bats (i.e., the root of the phylogeny in Fig. 5; median, 1982.81; HDP, 1965.75 to 1995.83). It is noted that this estimate refers only to the tMRCA of all currently sampled Bt-SLCoVs, and characterization of more diverged Bt-SLCoVs should extend the age of the lineage. Nonetheless, this estimate precludes codivergence of Bt-SLCoVs with their host bat species. More importantly, it suggests that cross-species transmission of these viruses between different bat species is very common and occurs on an ongoing basis. Interspecies transmissions of CoVs among wildlife and livestock species are well documented (46). With SARS as an example, more comprehensive surveillances of pathogens in wildlife species should make an important contribution to the detection and control of emerging zoonotic infections (24).
We thank Susanna K. P. Lau of the Department of Microbiology, Faculty of Medicine, University of Hong Kong, for her valuable comments on the manuscript.
Published ahead of print on 5 December 2007. ![]()
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»