Previous Article | Next Article 
Journal of Virology, May 2009, p. 4690-4694, Vol. 83, No. 9
0022-538X/09/$08.00+0 doi:10.1128/JVI.02358-08
Copyright © 2009, American Society for Microbiology. All Rights Reserved.
Relaxed Selection and the Evolution of RNA Virus Mucin-Like Pathogenicity Factors
Joel O. Wertheim* and
Michael Worobey
Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, Arizona
Received 12 November 2008/
Accepted 7 February 2009

ABSTRACT
Mucin-like regions contribute to pathogenicity in a variety
of negative-stranded RNA viruses. These regions are characterized
by a preponderance of O-linked glycosylation. They evolve exceptionally
rapidly yet maintain their function as pathogenicity factors.
Two hypotheses have been proposed to explain this evolutionary
conundrum of phenotypic stability in the face of extreme genetic
divergence: strong positive selection and relaxation of purifying
selection. We determined the strength and direction of selection
codon by codon across genes containing these regions and found
that purifying selection is relaxed over the mucin-like regions
relative to the genes in which they are found. This suggests
that so long as these regions maintain sufficient O-linked glycosylation,
they are free to evolve rapidly without loss of function as
pathogenicity factors.

TEXT
Viral mucin-like regions (MLRs) within glycoprotein genes act
as pathogenicity factors in negative-stranded RNA viruses of
humans. MLRs have been implicated as pathogenicity factors in
Ebola virus (EBOV), Marburg virus (MARV), Crimean-Congo hemorrhagic
fever virus (CCHFV), and human respiratory syncytial virus (hRSV)
(
13,
16,
18,
24). An MLR has also been identified in the glycoprotein
of human metapneumovirus (hMPV) (
8), but its importance in pathogenicity
has not been determined.
MLRs, like eukaryotic mucins, are characterized by a high prevalence of serine, threonine, and proline, which bind O-linked glycans and induce β-turns that trap water and salts (3). MLRs experience rapid nucleotide substitution compared to other coding regions of RNA viruses (1, 11, 12). In fact, among the four EBOV subtypes, the MLRs have accumulated so many substitutions that the level of dissimilarity between them approaches that of randomly generated sequences (7). Nevertheless, purifying selection appears to act on this region, as alignment of the EBOV MLRs showed that their lengths do not differ among subtypes and that O-glycosylation is maintained among all four subtypes (7). It is probable that these MLRs also serve a yet-undetermined function in the disease-free natural hosts of many of these viruses. Despite evidence of the importance of MLRs in disease progression in humans, little is known their evolutionary history.
Two non-mutually exclusive hypotheses have been proposed to explain this rapid sequence divergence within the MLR: positive selection (10, 12) and relaxation of purifying selection (2, 10). Positive selection results in an increased rate of adaptive amino acid substitutions; relaxation of purifying selection would also increase the rate of amino acid substitution, but this would be due to the lessening of selective constraints. The first hypothesis is supported by sporadic evidence of positive selection within MLRs (12, 14, 23, 25). In contrast, work by Sanchez et al. showed that the number of nonsynonymous mutations is higher across the entire MLR, indicating a possible relaxation of selection interspersed with some positive selection (12). Here, we explicitly test these hypotheses by comparing the selective pressures within the MLR to selection for the rest of the gene.
To better appreciate the dynamics that govern MLR evolution, we first conducted a detailed analysis of the patterns of divergence and base composition along genome alignments of each virus. Our goal here was to see whether we could identify patterns diagnostic of MLRs that could shed light on how they evolve. We analyzed nonoverlapping windows along complete genome alignments for each virus with known glycoprotein MLRs. Alignments were constructed in Se-Al (http://tree.bio.ed.ac.uk), using all available complete genomes from GenBank (www.ncbi.nlm.nih.gov) for each of the five viruses: EBOV, MARV, CCHFV, hRSV, and hMPV (Table 1). Identical and laboratory-derived mutant strains were removed. Protein coding regions were aligned manually; intergenic and ambiguous coding regions were aligned using Clustal_X (19). Alignments were partitioned into consecutive nonoverlapping 50-nucleotide windows. The mean uncorrected pairwise distances and maximum likelihood base frequencies for the regions were calculated using PAUP* version 4.1 (17). Uncorrected distances were used, because many sequences contained regions exhibiting divergence levels approaching saturation. Maximum likelihood models were determined using Akaike information criteria in ModelTest version 3.7 (9).
We observed that for each virus, the glycoprotein MLR contained
the region with the highest ratio of cytosine to uracil (C/U)
in the genome. This pattern makes sense mechanistically, since
the amino acids thought to be involved in the formation of MLRs
(serine, theonine, and proline) share a cytosine in the second
codon position. Moreover, each glycoprotein MLR was characterized
by a simultaneous increase in uncorrected pairwise distance
along with the increase in the C/U ratio (Fig.
1). C/U data
were normalized using log
10 transformation. To delineate the
putative boundaries of these MLRs for use in further analyses
(described below), we calculated the means and standard deviations
of log
10 C/U ratios for all 50-nucleotide windows across the
genomes. Regions with values at least 1.96 standard deviations
above the mean were classified as high C/U regions. If a series
of high C/U regions occurred within a coding region and was
accompanied by an increased mean uncorrected pairwise distance,
these regions were considered to define the putative MLR. This
method may also prove useful in detecting MLRs in the glycoproteins
of other RNA viruses.
We then tested whether the high prevalence of cytosine within
the MLRs was actually due to the presence of serine, threonine,
and proline. We looked at what happened to the frequency of
cytosine within the MLRs when serine, threonine, and proline
were removed from the sequences (Fig.
2). Without these amino
acids, the cytosine frequency within the MLRs resembles that
of the rest of the glycoprotein. Therefore, the apparent skewed
nucleotide profile of these regions, which we suggest can be
used to identify MLRs at the primary sequence level, is directly
related to their function as MLRs.
To further characterize MLR evolution, we partitioned the genome
alignments into protein coding regions, intergenic regions,
and MLRs and then estimated the C/U ratios and uncorrected pairwise
distances of these regions. As expected, intergenic regions
had greater mean uncorrected pairwise distances than coding
regions (two-sample
t test [
P < 0.001]) (Table
2). An exception
was hMPV, which contains very short intergenic regions. This
analysis was not performed with CCHFV, because it does not contain
intergenic regions. In general, the MLRs exhibited uncorrected
pairwise distances that were similar to those of the intergenic
regions but greater than the uncorrected pairwise distances
of the coding regions.
MLRs had a C/U ratio greater than that of all other coding and
intergenic regions (Table
2). In hRSV and hMPV, there were no
significant differences between the C/U values for coding and
intergenic regions (two-sample
t test [
P = 0.604]); however,
for EBOV and MARV, the C/U value for the intergenic regions
was significantly lower than that for the coding regions (two-sample
t test [
P < 0.001]). In summary, MLRs evolve at a rate indistinguishable
from that of intergenic RNA but, in the cases of EBOV and MARV,
have a base composition opposed to the mutational bias of the
rest of the genome. The question remains, what type of selection
pressures would cause such bizarre evolution?
To address this question, we estimated the strength and direction of selection at each codon for all five glycoprotein genes containing an MLR. We used a fixed-effects likelihood method in Datamonkey (www.datamonkey.org) to detect evidence of positive and negative (purifying) selection (
= 0.05) (5, 6). Glycoprotein sequences were obtained from GenBank (Table 1) and aligned using Se-Al (http://tree.bio.ed.ac.uk). These genes were screened for recombination using GENECONV (15), but no evidence of discordant phylogenies was observed. The section of the EBOV glycoprotein containing multiple reading frames was removed from the analysis (11, 22).
We found very strong evidence of relaxed purifying selection in MLRs. Within the glycoprotein genes of EBOV, CCHFV, hRSV, and hMPV, there were significant decreases in the numbers of negatively selected sites within the MLR compared to the rest of the gene (Fisher's exact test [P < 0.05]) (Table 3). The MLR of the MARV glycoprotein did not show a significant decrease in the number of sites under purifying selection, but this appears to have been the result of a low level of purifying selection in the non-MLR rather than the presence of strong constraints in the MLR. On the other hand, while a few positively selected sites were detected within many MLRs, only the glycoprotein MLR of CCHFV exhibited a significant increase in positive selection relative to the rest of the gene (Fisher's exact test [P < 0.0001]) (Table 3). Clearly, the rapid evolutionary change in viral MLRs cannot be explained by strong positive selection.
Natural selection is acting on MLRs, as evidenced by the maintenance
of O-linked glycosylation, the conservation of their length
in EBOV, and the increase in C/U ratios. Our results indicate
that this selection is not concentrated on any given codon but
rather is relaxed over the entire region. We propose that the
primary protein structure of an MLR is, by and large, not critical
to its function as a pathogenicity factor. So long as sufficient
O-linked glycosylation is preserved, the MLR will function in
spite of rapid sequence divergence. We found no evidence that
positive selection is strong enough to account for the rapid
evolution of MLRs. We propose that positive and purifying selection,
while present, is acting on the MLR as a whole and not on specific
amino acid residues, a molecular evolutionary pattern that stands
apart from almost every other described example.
The closest parallels that we know of to the evolutionary dynamics observed in these MLRs are those governing spider silk protein evolution. Spider silk proteins contain high proportions of repeating amino acid sequences that lead to an increased frequency of certain nucleotides and undergo rapid sequence divergence, even in the presence of purifying selection (4). Viral MLRs, however, are unique in that they experience relaxed selection in the absence of any repeat structure.
These MLRs are different from other mucins found in DNA viruses such as the channel catfish virus. This virus is a herpesvirus encoding a mucin that exhibits a repeat structure reminiscent of those of eukaryotic mucins. The channel catfish virus mucin may also be a pathogenicity factor, as a strain lacking this gene is attenuated (20, 21). The nonrepetitive viral MLRs described here may represent a novel way in which viruses can evolve pathogenicity factors. Why these MLRs are seen only in negative-stranded RNA viruses remains unclear.

ACKNOWLEDGMENTS
We thank Betsy Wertheim and Adam Bjork for helpful comments
on the manuscript. We also thank the anonymous reviewers for
their contributions.
This work was supported by the NSF-IGERT (NSF-Integrative Graduate Education and Research Traineeship) in Evolutionary, Functional, and Computational Genomics at the University of Arizona and the David and Lucile Packard Foundation.

FOOTNOTES
* Corresponding author. Mailing address: Department of Ecology and Evolutionary Biology, Biosciences West, 1041 E. Lowell St., University of Arizona, Tucson, AZ 85721. Phone: (520) 621-4881. Fax: (520) 621-9190. E-mail:
wertheim{at}email.arizona.edu 
Published ahead of print on 18 February 2009. 

REFERENCES
1 - Biacchesi, S., M. H. Skiadopoulos, G. Boivin, C. T. Hanson, B. R. Murphy, P. L. Collins, and U. J. Buchholz. 2003. Genetic diversity between human metapneumovirus subgroups. Virology 315:1-9.[CrossRef][Medline]
2 - Deyde, V. M., M. L. Khristova, P. E. Rollin, T. G. Ksiazek, and S. T. Nichol. 2006. Crimean-Congo hemorrhagic fever virus genomics and global diversity. J. Virol. 80:8834-8842.[Abstract/Free Full Text]
3 - Gerken, T. A., C. L. Owens, and M. Pasumarthy. 1997. Determination of the site-specific O-glycosylation pattern of the porcine submaxillary mucin tandem repeat glycopeptide. J. Biol. Chem. 272:9709-9719.[Abstract/Free Full Text]
4 - Hayashi, C. Y., and R. V. Lewis. 2000. Molecular architecture and evolution of a modular spider silk protein gene. Science 287:1477-1479.[Abstract/Free Full Text]
5 - Kosakovsky Pond, S. L., and S. D. Frost. 2005. Datamonkey: rapid detection of selective pressure on individual sites of codon alignments. Bioinformatics 21:2531-2533.[Abstract/Free Full Text]
6 - Kosakovsky Pond, S. L., and S. D. Frost. 2005. Not so different after all: a comparison of methods for detecting amino acid sites under selection. Mol. Biol. Evol. 22:1208-1222.[Abstract/Free Full Text]
7 - Lee, J. E., M. L. Fusco, A. J. Hessell, W. B. Oswald, D. R. Burton, and E. O. Saphire. 2008. Structure of the Ebola virus glycoprotein bound to an antibody from a human survivor. Nature 454:177-182.[CrossRef][Medline]
8 - Peret, T. C., Y. Abed, L. J. Anderson, D. D. Erdman, and G. Boivin. 2004. Sequence polymorphism of the predicted human metapneumovirus G glycoprotein. J. Gen. Virol. 85:679-686.[Abstract/Free Full Text]
9 - Posada, D., and K. A. Crandall. 1998. MODELTEST: testing the model of DNA substitution. Bioinformatics 14:817-818.[Abstract/Free Full Text]
10 - Sanchez, A., T. G. Ksiazek, P. E. Rollin, M. E. Miranda, S. G. Trappier, A. S. Khan, C. J. Peters, and S. T. Nichol. 1999. Detection and molecular characterization of Ebola viruses causing disease in human and nonhuman primates. J. Infect. Dis. 179(Suppl. 1):S164-S169.[CrossRef][Medline]
11 - Sanchez, A., S. G. Trappier, B. W. Mahy, C. J. Peters, and S. T. Nichol. 1996. The virion glycoproteins of Ebola viruses are encoded in two reading frames and are expressed through transcriptional editing. Proc. Natl. Acad. Sci. USA 93:3602-3607.[Abstract/Free Full Text]
12 - Sanchez, A., S. G. Trappier, U. Stroher, S. T. Nichol, M. D. Bowen, and H. Feldmann. 1998. Variation in the glycoprotein and VP35 genes of Marburg virus strains. Virology 240:138-146.[CrossRef][Medline]
13 - Sanchez, A. J., M. J. Vincent, B. R. Erickson, and S. T. Nichol. 2006. Crimean-Congo hemorrhagic fever virus glycoprotein precursor is cleaved by furin-like and SKI-1 proteases to generate a novel 38-kilodalton glycoprotein. J. Virol. 80:514-525.[Abstract/Free Full Text]
14 - Sanchez, A. J., M. J. Vincent, and S. T. Nichol. 2002. Characterization of the glycoproteins of Crimean-Congo hemorrhagic fever virus. J. Virol. 76:7263-7275.[Abstract/Free Full Text]
15 - Sawyer, S. 1989. Statistical tests for detecting gene conversion. Mol. Biol. Evol. 6:526-538.[Abstract]
16 - Sparer, T. E., S. Matthews, T. Hussell, A. J. Rae, B. Garcia-Barreno, J. A. Melero, and P. J. Openshaw. 1998. Eliminating a region of respiratory syncytial virus attachment protein allows induction of protective immunity without vaccine-enhanced lung eosinophilia. J. Exp. Med. 187:1921-1926.[Abstract/Free Full Text]
17 - Swofford, D. 2002. PAUP*: phylogenetic analysis using parsimony (*and other methods), version 4.1. Sinauer Associates, Sunderland, MA.
18 - Takada, A., K. Fujioka, M. Tsuiji, A. Morikawa, N. Higashi, H. Ebihara, D. Kobasa, H. Feldmann, T. Irimura, and Y. Kawaoka. 2004. Human macrophage C-type lectin specific for galactose and N-acetylgalactosamine promotes filovirus entry. J. Virol. 78:2943-2947.[Abstract/Free Full Text]
19 - Thompson, J. D., T. J. Gibson, F. Plewniak, F. Jeanmougin, and D. G. Higgins. 1997. The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 25:4876-4882.[Abstract/Free Full Text]
20 - Vanderheijden, N., P. Alard, C. Lecomte, and J. A. Martial. 1996. The attenuated V60 strain of channel catfish virus possesses a deletion in ORF50 coding for a potentially secreted glycoprotein. Virology 218:422-426.[CrossRef][Medline]
21 - Vanderheijden, N., L. A. Hanson, E. Thiry, and J. A. Martial. 1999. Channel catfish virus gene 50 encodes a secreted, mucin-like glycoprotein. Virology 257:220-227.[CrossRef][Medline]
22 - Volchkov, V. E., S. Becker, V. A. Volchkova, V. A. Ternovoj, A. N. Kotov, S. V. Netesov, and H. D. Klenk. 1995. GP mRNA of Ebola virus is edited by the Ebola virus polymerase and by T7 and vaccinia virus polymerases. Virology 214:421-430.[CrossRef][Medline]
23 - Woelk, C. H., and E. C. Holmes. 2001. Variable immune-driven natural selection in the attachment (G) glycoprotein of respiratory syncytial virus (RSV). J. Mol. Evol. 52:182-192.[Medline]
24 - Yang, Z. Y., H. J. Duckers, N. J. Sullivan, A. Sanchez, E. G. Nabel, and G. J. Nabel. 2000. Identification of the Ebola virus glycoprotein as the main viral determinant of vascular cell cytotoxicity and injury. Nat. Med. 6:886-889.[CrossRef][Medline]
25 - Zlateva, K. T., P. Lemey, E. Moes, A. M. Vandamme, and M. Van Ranst. 2005. Genetic variability and molecular evolution of the human respiratory syncytial virus subgroup B attachment G protein. J. Virol. 79:9157-9167.[Abstract/Free Full Text]
Journal of Virology, May 2009, p. 4690-4694, Vol. 83, No. 9
0022-538X/09/$08.00+0 doi:10.1128/JVI.02358-08
Copyright © 2009, American Society for Microbiology. All Rights Reserved.